[Python] Calculating Basic Statistics with Pandas (Mean, Max, Min, etc.)

In the initial stages of data analysis, it is crucial to understand the overall picture of your data by checking basic statistics (descriptive statistics) such as the “mean,” “median,” and “standard deviation.” Pandas DataFrames provide a wealth of methods to calculate these figures.

This article explains how to calculate individual statistics and how to check major statistics all at once using the describe() method.

目次

1. List of Major Statistics Calculation Methods

The main methods that can be called on a DataFrame or Series are as follows. By default, these are calculated for each “column.”

MethodMeaningRemarks
count()Number of elementsExcludes missing values (NaN).
mean()Mean valueAverage.
median()Median valueThe middle value when data is sorted.
mode()ModeThe most frequently occurring value (multiple may exist).
max()Maximum valueThe largest value.
min()Minimum valueThe smallest value.
std()Standard deviationMeasure of data dispersion (Unbiased standard deviation).
var()VarianceSquare of standard deviation (Unbiased variance).
sample()Random samplingNot a statistic, but used for sampling data.
describe()Summary statisticsCalculates the major values listed above all at once.

2. Implementation Sample Code

Here, we will calculate various statistics using a list of real estate properties (Price, Area, Age) as the subject.

import pandas as pd

def calculate_statistics():
    """
    Pandas DataFrameを用いて基本統計量を算出する関数
    """
    
    # 1. サンプルデータの作成: 不動産物件データ
    # Price: 価格(万円), Area: 面積(m2), Age: 築年数
    property_data = {
        "Price": [3500, 4200, 2800, 5500, 4200],
        "Area": [45.5, 60.0, 38.2, 85.0, 55.0],
        "Age": [15, 5, 25, 2, 12]
    }
    
    df = pd.DataFrame(property_data)
    
    print("--- 元のデータセット ---")
    print(df)
    print("\n")


    # 2. 個別の統計量を算出
    print("=== 個別の統計量 ===")
    
    # 平均値 (Mean)
    # 各列の平均がSeriesとして返されます
    mean_val = df.mean()
    print(f"[平均値]\n{mean_val}\n")
    
    # 中央値 (Median)
    median_val = df.median()
    print(f"[中央値]\n{median_val}\n")
    
    # 最頻値 (Mode)
    # 最頻値は複数存在する可能性があるため、DataFrame形式で返されます
    mode_val = df["Price"].mode()
    print(f"[価格の最頻値]: {mode_val[0]} 万円\n")

    # 標準偏差 (Std)
    std_val = df.std()
    print(f"[標準偏差]\n{std_val}\n")


    # 3. 要約統計量の一括取得 (describe)
    print("=== 要約統計量 (describe) ===")
    # count, mean, std, min, 25%, 50%, 75%, max が一度に計算されます
    description = df.describe()
    print(description)
    print("\n")
    
    
    # 4. ランダムサンプリング (sample)
    print("=== ランダムサンプリング ===")
    # ランダムに2件のデータを抽出
    # random_stateを固定すると再現性が保たれます
    sampled_df = df.sample(n=2, random_state=1)
    print(sampled_df)

if __name__ == "__main__":
    calculate_statistics()

3. Execution Result

--- 元のデータセット ---
   Price  Area  Age
0   3500  45.5   15
1   4200  60.0    5
2   2800  38.2   25
3   5500  85.0    2
4   4200  55.0   12


=== 個別の統計量 ===
[平均値]
Price    4040.00
Area       56.74
Age        11.80
dtype: float64

[中央値]
Price    4200.0
Area       55.0
Age        12.0
dtype: float64

[価格の最頻値]: 4200 万円

[標準偏差]
Price    993.981891
Area      17.954052
Age        9.093954
dtype: float64


=== 要約統計量 (describe) ===
             Price       Area        Age
count     5.000000   5.000000   5.000000
mean   4040.000000  56.740000  11.800000
std     993.981891  17.954052   9.093954
min    2800.000000  38.200000   2.000000
25%    3500.000000  45.500000   5.000000
50%    4200.000000  55.000000  12.000000
75%    4200.000000  60.000000  15.000000
max    5500.000000  85.000000  25.000000


=== ランダムサンプリング ===
   Price  Area  Age
2   2800  38.2   25
1   4200  60.0    5

4. Explanation: Convenience of the describe() Method

The describe() method is a very powerful tool for instantly grasping data trends.

  • count: Number of data points (excluding missing values).
  • mean: Average value.
  • std: Standard deviation.
  • min / max: Minimum and Maximum values.
  • 25%, 50%, 75%: Quartiles (50% is the same as the median).

Executing df.describe() immediately after loading a DataFrame to check the general distribution and for outliers (e.g., checking for extreme max/min values) is a standard practice in data analysis.

よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次