In the initial stages of data analysis, it is crucial to understand the overall picture of your data by checking basic statistics (descriptive statistics) such as the “mean,” “median,” and “standard deviation.” Pandas DataFrames provide a wealth of methods to calculate these figures.
This article explains how to calculate individual statistics and how to check major statistics all at once using the describe() method.
1. List of Major Statistics Calculation Methods
The main methods that can be called on a DataFrame or Series are as follows. By default, these are calculated for each “column.”
| Method | Meaning | Remarks |
| count() | Number of elements | Excludes missing values (NaN). |
| mean() | Mean value | Average. |
| median() | Median value | The middle value when data is sorted. |
| mode() | Mode | The most frequently occurring value (multiple may exist). |
| max() | Maximum value | The largest value. |
| min() | Minimum value | The smallest value. |
| std() | Standard deviation | Measure of data dispersion (Unbiased standard deviation). |
| var() | Variance | Square of standard deviation (Unbiased variance). |
| sample() | Random sampling | Not a statistic, but used for sampling data. |
| describe() | Summary statistics | Calculates the major values listed above all at once. |
2. Implementation Sample Code
Here, we will calculate various statistics using a list of real estate properties (Price, Area, Age) as the subject.
import pandas as pd
def calculate_statistics():
"""
Pandas DataFrameを用いて基本統計量を算出する関数
"""
# 1. サンプルデータの作成: 不動産物件データ
# Price: 価格(万円), Area: 面積(m2), Age: 築年数
property_data = {
"Price": [3500, 4200, 2800, 5500, 4200],
"Area": [45.5, 60.0, 38.2, 85.0, 55.0],
"Age": [15, 5, 25, 2, 12]
}
df = pd.DataFrame(property_data)
print("--- 元のデータセット ---")
print(df)
print("\n")
# 2. 個別の統計量を算出
print("=== 個別の統計量 ===")
# 平均値 (Mean)
# 各列の平均がSeriesとして返されます
mean_val = df.mean()
print(f"[平均値]\n{mean_val}\n")
# 中央値 (Median)
median_val = df.median()
print(f"[中央値]\n{median_val}\n")
# 最頻値 (Mode)
# 最頻値は複数存在する可能性があるため、DataFrame形式で返されます
mode_val = df["Price"].mode()
print(f"[価格の最頻値]: {mode_val[0]} 万円\n")
# 標準偏差 (Std)
std_val = df.std()
print(f"[標準偏差]\n{std_val}\n")
# 3. 要約統計量の一括取得 (describe)
print("=== 要約統計量 (describe) ===")
# count, mean, std, min, 25%, 50%, 75%, max が一度に計算されます
description = df.describe()
print(description)
print("\n")
# 4. ランダムサンプリング (sample)
print("=== ランダムサンプリング ===")
# ランダムに2件のデータを抽出
# random_stateを固定すると再現性が保たれます
sampled_df = df.sample(n=2, random_state=1)
print(sampled_df)
if __name__ == "__main__":
calculate_statistics()
3. Execution Result
--- 元のデータセット ---
Price Area Age
0 3500 45.5 15
1 4200 60.0 5
2 2800 38.2 25
3 5500 85.0 2
4 4200 55.0 12
=== 個別の統計量 ===
[平均値]
Price 4040.00
Area 56.74
Age 11.80
dtype: float64
[中央値]
Price 4200.0
Area 55.0
Age 12.0
dtype: float64
[価格の最頻値]: 4200 万円
[標準偏差]
Price 993.981891
Area 17.954052
Age 9.093954
dtype: float64
=== 要約統計量 (describe) ===
Price Area Age
count 5.000000 5.000000 5.000000
mean 4040.000000 56.740000 11.800000
std 993.981891 17.954052 9.093954
min 2800.000000 38.200000 2.000000
25% 3500.000000 45.500000 5.000000
50% 4200.000000 55.000000 12.000000
75% 4200.000000 60.000000 15.000000
max 5500.000000 85.000000 25.000000
=== ランダムサンプリング ===
Price Area Age
2 2800 38.2 25
1 4200 60.0 5
4. Explanation: Convenience of the describe() Method
The describe() method is a very powerful tool for instantly grasping data trends.
- count: Number of data points (excluding missing values).
- mean: Average value.
- std: Standard deviation.
- min / max: Minimum and Maximum values.
- 25%, 50%, 75%: Quartiles (50% is the same as the median).
Executing df.describe() immediately after loading a DataFrame to check the general distribution and for outliers (e.g., checking for extreme max/min values) is a standard practice in data analysis.
