Extracting specific rows that meet certain criteria from a Pandas DataFrame holding large amounts of data is one of the most critical steps in data analysis. Pandas uses a mechanism called “Boolean Indexing” to allow for intuitive data extraction, similar to the SQL WHERE clause or Excel’s filter function.
This article explains practical filtering techniques, ranging from numerical filtering using comparison operators to string extraction using regular expressions.
Data Extraction via Boolean Indexing
Boolean indexing is a method where you pass an array (Series) of truth values—determining whether each row satisfies a condition (True) or not (False)—to the DataFrame to extract only the rows where the value is True.
Basic Filtering Using Comparison Operators
Here is an example of performing magnitude comparisons or exact match checks on numerical data. We will use a scenario involving a library book catalog.
import pandas as pd
def basic_filtering_demo():
"""
Function to demonstrate basic filtering techniques using comparison operators
"""
# Create book data
book_data = {
"Title": ["Python Basics", "Mastering Java", "SQL Guide", "Data Science 101", "Advanced C++"],
"Category": ["Programming", "Programming", "Database", "Analysis", "Programming"],
"Price": [2800, 3500, 1500, 4200, 3800],
"Stock": [15, 4, 20, 7, 0]
}
df = pd.DataFrame(book_data)
print("--- Original Book List ---")
print(df)
print("\n")
# 1. Extraction by Exact Match (Books where Category is 'Programming')
programming_books = df[df["Category"] == "Programming"]
print("--- Programming Books ---")
print(programming_books)
# 2. Extraction by Comparison (Books where Price is greater than 3000)
expensive_books = df[df["Price"] > 3000]
print("\n--- Books over 3000 Yen ---")
print(expensive_books)
# 3. Extract only even-numbered rows of the index
# Perform operation on the index to generate a True/False sequence
even_rows = df[df.index % 2 == 0]
print("\n--- Data in Even Index Rows ---")
print(even_rows)
if __name__ == "__main__":
basic_filtering_demo()
Extraction Combining Multiple Conditions (AND / OR)
When performing filtering by combining multiple conditions, enclose each condition in parentheses () and use bitwise operators (& or |).
- AND condition (
&): Extracts rows that satisfy all conditions. - OR condition (
|): Extracts rows that satisfy any of the conditions.
import pandas as pd
def multiple_conditions_demo():
"""
Function to demonstrate filtering with combined conditions
"""
data = {
"Model": ["Standard-A", "Standard-B", "Pro-X", "Pro-Y", "Ultra-Z"],
"Sales": [120, 85, 200, 150, 300],
"Price": [15000, 18000, 45000, 48000, 95000]
}
df = pd.DataFrame(data)
# Example: Extract models where Price is 20000 or more AND Sales are 200 or more
# Enclosing each condition in () is mandatory
condition_and = (df["Price"] >= 20000) & (df["Sales"] >= 200)
filtered_df = df[condition_and]
print("--- High Margin Models (Price>=20000 & Sales>=200) ---")
print(filtered_df)
if __name__ == "__main__":
multiple_conditions_demo()
Advanced Filtering Using String Operations and Regular Expressions
For string-type columns, you can use powerful search functions via the .str accessor. The contains method is particularly useful for extracting strings that contain specific patterns.
import pandas as pd
def string_regex_filtering():
"""
Function to demonstrate filtering using string operations and regular expressions
"""
# Create customer feedback data
feedback_data = {
"Customer": ["User_01", "User_02", "User_03", "User_04", "User_05"],
"Comment": [
"The design is excellent.",
"Performance is slow.",
"Great performance and support.",
"Not good value for money.",
"Excellent performance overall."
]
}
df = pd.DataFrame(feedback_data)
# 1. Extract rows containing specific keywords
# Extract comments containing the word "performance" (case-insensitive setting)
performance_feedback = df[df["Comment"].str.contains("performance", case=False)]
print("--- Feedback regarding Performance ---")
print(performance_feedback)
# 2. Complex extraction using Regular Expressions
# Extract comments starting with "Excellent" or "Great"
# ^ means start of line, | means OR in regex
regex_condition = df["Comment"].str.contains(r"^(Excellent|Great)", regex=True)
positive_feedback = df[regex_condition]
print("\n--- Positive Feedback (Regex Extraction) ---")
print(positive_feedback)
if __name__ == "__main__":
string_regex_filtering()
Explanation of Execution Results
- Comparison Operations: When you write
df[df["Price"] > 3000], a Series of boolean values (True/False) is generated internally for each row, and only the rows matching the condition are reconstructed. - Multiple Conditions: When using the
&operator, you cannot use the Pythonandkeyword in Pandas. You must use the bitwise operator&and enclose the conditional expressions in parentheses to clarify the order of operations. - str.contains: This allows for easy partial match searches. By setting
regex=True(the default), you can perform powerful pattern matching using regular expressions.
Mastering these filtering techniques allows you to instantly pinpoint necessary information from vast amounts of data and improve the precision of your analysis.
