Basics of Data Aggregation with GroupBy
In data analysis, grouping data by specific attributes and checking their statistical values is a crucial task. Using the groupby method in the Pandas library, you can aggregate data just like the SQL GROUP BY clause to easily calculate averages, totals, and more.
Preparing Sample Data
First, create a DataFrame for aggregation. Here, we assume a scenario containing data on product categories, unit prices, and stock quantities.
import pandas as pd
# Definition of original data
# Product Category, Price, Stock
inventory_data = {
"Category": ["Electronics", "Furniture", "Electronics", "Stationery", "Furniture", "Stationery"],
"Price": [55000, 12000, 48000, 500, 15000, 450],
"Stock": [15, 8, 22, 100, 5, 120]
}
df = pd.DataFrame(inventory_data)
print("--- Original DataFrame ---")
print(df)
Aggregation Methods for Statistics
After grouping with groupby, you can calculate categorical figures by calling statistical methods.
Calculating Various Statistics
Here is how to individually calculate major statistics: mean, sum, minimum (min), maximum (max), variance (var), and standard deviation (std).
# Group by Category
grouped = df.groupby("Category")
# Calculate Mean
mean_df = grouped.mean(numeric_only=True)
print("\n--- Mean (mean) ---")
print(mean_df)
# Calculate Sum
sum_df = grouped.sum(numeric_only=True)
print("\n--- Sum (sum) ---")
print(sum_df)
# Calculate Minimum
min_df = grouped.min()
print("\n--- Minimum (min) ---")
print(min_df)
# Calculate Maximum
max_df = grouped.max()
print("\n--- Maximum (max) ---")
print(max_df)
# Calculate Variance
var_df = grouped.var(numeric_only=True)
print("\n--- Variance (var) ---")
print(var_df)
# Calculate Standard Deviation
std_df = grouped.std(numeric_only=True)
print("\n--- Standard Deviation (std) ---")
print(std_df)
Execution Result
Running the code above yields the following output.
--- Original DataFrame ---
Category Price Stock
0 Electronics 55000 15
1 Furniture 12000 8
2 Electronics 48000 22
3 Stationery 500 100
4 Furniture 15000 5
5 Stationery 450 120
--- Mean (mean) ---
Price Stock
Category
Electronics 51500.0 18.5
Furniture 13500.0 6.5
Stationery 475.0 110.0
--- Sum (sum) ---
Price Stock
Category
Electronics 103000 37
Furniture 27000 13
Stationery 950 220
--- Minimum (min) ---
Price Stock
Category
Electronics 48000 15
Furniture 12000 5
Stationery 450 100
--- Maximum (max) ---
Price Stock
Category
Electronics 55000 22
Furniture 15000 8
Stationery 500 120
--- Variance (var) ---
Price Stock
Category
Electronics 24500000.0 24.5
Furniture 4500000.0 4.5
Stationery 1250.0 200.0
--- Standard Deviation (std) ---
Price Stock
Category
Electronics 4949.747468 4.949747
Furniture 2121.320344 2.121320
Stationery 35.355339 14.142136
Meaning of Statistics and Supplemental Information
- Variance: Indicates the degree of dispersion in data. It is defined as the average of the squared differences between each number and the mean (divided by n-1 for unbiased variance).
- Standard Deviation: The positive square root of the variance. It is an indicator expressing dispersion in the same unit as the original data.
Pandas calculates unbiased variance and unbiased standard deviation by default. Using these allows for a detailed understanding of data trends by category.
