[Python] How to Extract (Filter) Data Matching Conditions in a Pandas DataFrame

2026年1月6日

Extracting specific rows that meet certain criteria from a Pandas DataFrame holding large amounts of data is one of the most critical steps in data analysis. Pandas uses a mechanism called “Boolean Indexing” to allow for intuitive data extraction, similar to the SQL WHERE clause or Excel’s filter function.

This article explains practical filtering techniques, ranging from numerical filtering using comparison operators to string extraction using regular expressions.

Data Extraction via Boolean Indexing

Boolean indexing is a method where you pass an array (Series) of truth values—determining whether each row satisfies a condition (True) or not (False)—to the DataFrame to extract only the rows where the value is True.

Basic Filtering Using Comparison Operators

Here is an example of performing magnitude comparisons or exact match checks on numerical data. We will use a scenario involving a library book catalog.

import pandas as pd

def basic_filtering_demo():
    """
    Function to demonstrate basic filtering techniques using comparison operators
    """
    # Create book data
    book_data = {
        "Title": ["Python Basics", "Mastering Java", "SQL Guide", "Data Science 101", "Advanced C++"],
        "Category": ["Programming", "Programming", "Database", "Analysis", "Programming"],
        "Price": [2800, 3500, 1500, 4200, 3800],
        "Stock": [15, 4, 20, 7, 0]
    }
    df = pd.DataFrame(book_data)

    print("--- Original Book List ---")
    print(df)
    print("\n")

    # 1. Extraction by Exact Match (Books where Category is 'Programming')
    programming_books = df[df["Category"] == "Programming"]
    print("--- Programming Books ---")
    print(programming_books)

    # 2. Extraction by Comparison (Books where Price is greater than 3000)
    expensive_books = df[df["Price"] > 3000]
    print("\n--- Books over 3000 Yen ---")
    print(expensive_books)

    # 3. Extract only even-numbered rows of the index
    # Perform operation on the index to generate a True/False sequence
    even_rows = df[df.index % 2 == 0]
    print("\n--- Data in Even Index Rows ---")
    print(even_rows)

if __name__ == "__main__":
    basic_filtering_demo()

Extraction Combining Multiple Conditions (AND / OR)

When performing filtering by combining multiple conditions, enclose each condition in parentheses () and use bitwise operators (& or |).

AND condition (&): Extracts rows that satisfy all conditions.
OR condition (|): Extracts rows that satisfy any of the conditions.

import pandas as pd

def multiple_conditions_demo():
    """
    Function to demonstrate filtering with combined conditions
    """
    data = {
        "Model": ["Standard-A", "Standard-B", "Pro-X", "Pro-Y", "Ultra-Z"],
        "Sales": [120, 85, 200, 150, 300],
        "Price": [15000, 18000, 45000, 48000, 95000]
    }
    df = pd.DataFrame(data)

    # Example: Extract models where Price is 20000 or more AND Sales are 200 or more
    # Enclosing each condition in () is mandatory
    condition_and = (df["Price"] >= 20000) & (df["Sales"] >= 200)
    filtered_df = df[condition_and]

    print("--- High Margin Models (Price>=20000 & Sales>=200) ---")
    print(filtered_df)

if __name__ == "__main__":
    multiple_conditions_demo()

Advanced Filtering Using String Operations and Regular Expressions

For string-type columns, you can use powerful search functions via the .str accessor. The contains method is particularly useful for extracting strings that contain specific patterns.

import pandas as pd

def string_regex_filtering():
    """
    Function to demonstrate filtering using string operations and regular expressions
    """
    # Create customer feedback data
    feedback_data = {
        "Customer": ["User_01", "User_02", "User_03", "User_04", "User_05"],
        "Comment": [
            "The design is excellent.",
            "Performance is slow.",
            "Great performance and support.",
            "Not good value for money.",
            "Excellent performance overall."
        ]
    }
    df = pd.DataFrame(feedback_data)

    # 1. Extract rows containing specific keywords
    # Extract comments containing the word "performance" (case-insensitive setting)
    performance_feedback = df[df["Comment"].str.contains("performance", case=False)]
    
    print("--- Feedback regarding Performance ---")
    print(performance_feedback)

    # 2. Complex extraction using Regular Expressions
    # Extract comments starting with "Excellent" or "Great"
    # ^ means start of line, | means OR in regex
    regex_condition = df["Comment"].str.contains(r"^(Excellent|Great)", regex=True)
    positive_feedback = df[regex_condition]

    print("\n--- Positive Feedback (Regex Extraction) ---")
    print(positive_feedback)

if __name__ == "__main__":
    string_regex_filtering()

Explanation of Execution Results

Comparison Operations: When you write df[df["Price"] > 3000], a Series of boolean values (True/False) is generated internally for each row, and only the rows matching the condition are reconstructed.
Multiple Conditions: When using the & operator, you cannot use the Python and keyword in Pandas. You must use the bitwise operator & and enclose the conditional expressions in parentheses to clarify the order of operations.
str.contains: This allows for easy partial match searches. By setting regex=True (the default), you can perform powerful pattern matching using regular expressions.

Mastering these filtering techniques allows you to instantly pinpoint necessary information from vast amounts of data and improve the precision of your analysis.

よかったらシェアしてね！