[Python] How to Use groupby to Aggregate Statistics by Category in Pandas

目次

Basics of Data Aggregation with GroupBy

In data analysis, grouping data by specific attributes and checking their statistical values is a crucial task. Using the groupby method in the Pandas library, you can aggregate data just like the SQL GROUP BY clause to easily calculate averages, totals, and more.

Preparing Sample Data

First, create a DataFrame for aggregation. Here, we assume a scenario containing data on product categories, unit prices, and stock quantities.

import pandas as pd

# Definition of original data
# Product Category, Price, Stock
inventory_data = {
    "Category": ["Electronics", "Furniture", "Electronics", "Stationery", "Furniture", "Stationery"],
    "Price": [55000, 12000, 48000, 500, 15000, 450],
    "Stock": [15, 8, 22, 100, 5, 120]
}

df = pd.DataFrame(inventory_data)
print("--- Original DataFrame ---")
print(df)

Aggregation Methods for Statistics

After grouping with groupby, you can calculate categorical figures by calling statistical methods.

Calculating Various Statistics

Here is how to individually calculate major statistics: mean, sum, minimum (min), maximum (max), variance (var), and standard deviation (std).

# Group by Category
grouped = df.groupby("Category")

# Calculate Mean
mean_df = grouped.mean(numeric_only=True)
print("\n--- Mean (mean) ---")
print(mean_df)

# Calculate Sum
sum_df = grouped.sum(numeric_only=True)
print("\n--- Sum (sum) ---")
print(sum_df)

# Calculate Minimum
min_df = grouped.min()
print("\n--- Minimum (min) ---")
print(min_df)

# Calculate Maximum
max_df = grouped.max()
print("\n--- Maximum (max) ---")
print(max_df)

# Calculate Variance
var_df = grouped.var(numeric_only=True)
print("\n--- Variance (var) ---")
print(var_df)

# Calculate Standard Deviation
std_df = grouped.std(numeric_only=True)
print("\n--- Standard Deviation (std) ---")
print(std_df)

Execution Result

Running the code above yields the following output.

--- Original DataFrame ---
      Category  Price  Stock
0  Electronics  55000     15
1    Furniture  12000      8
2  Electronics  48000     22
3   Stationery    500    100
4    Furniture  15000      5
5   Stationery    450    120

--- Mean (mean) ---
               Price  Stock
Category                   
Electronics  51500.0   18.5
Furniture    13500.0    6.5
Stationery     475.0  110.0

--- Sum (sum) ---
              Price  Stock
Category                  
Electronics  103000     37
Furniture     27000     13
Stationery      950    220

--- Minimum (min) ---
             Price  Stock
Category                 
Electronics  48000     15
Furniture    12000      5
Stationery     450    100

--- Maximum (max) ---
             Price  Stock
Category                 
Electronics  55000     22
Furniture    15000      8
Stationery     500    120

--- Variance (var) ---
                  Price  Stock
Category                      
Electronics  24500000.0   24.5
Furniture     4500000.0    4.5
Stationery       1250.0  200.0

--- Standard Deviation (std) ---
                   Price      Stock
Category                           
Electronics  4949.747468   4.949747
Furniture    2121.320344   2.121320
Stationery     35.355339  14.142136

Meaning of Statistics and Supplemental Information

  • Variance: Indicates the degree of dispersion in data. It is defined as the average of the squared differences between each number and the mean (divided by n-1 for unbiased variance).
  • Standard Deviation: The positive square root of the variance. It is an indicator expressing dispersion in the same unit as the original data.

Pandas calculates unbiased variance and unbiased standard deviation by default. Using these allows for a detailed understanding of data trends by category.

よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次