In the process of data analysis, simply deleting missing values (NaN) carries the risk of reducing the amount of important data or losing the continuity of time series. Therefore, it is common to fill (impute) these holes with “0”, the “mean”, or “surrounding values” depending on the nature of the data.
This article explains how to fill missing values with appropriate values using the Pandas fillna, ffill, and bfill methods.
Basic Strategy for Filling Missing Values
The appropriate method for filling depends on the type of data.
- Filling with a fixed value: Using numerical
0or the string"Unknown". - Filling with statistics: Using the mean, median, or mode. This is effective when you do not want to distort the distribution of the data.
- Filling with surrounding values: Effective for time-series data where it can be assumed that the most recent state continues.
Implementation Sample Code
Here, we will use a web server’s monitoring logs (CPU usage, memory usage, active session count) as the subject. We start with a state where some data is missing due to communication errors or similar issues.
import pandas as pd
import numpy as np
def demonstrate_fillna():
"""
Function to demonstrate various methods for filling missing values using Pandas
"""
# 1. Create Sample Data (Server Monitoring Logs)
# CPU: Percentage
# Memory: Usage (GB)
# Sessions: Connection count
server_logs = {
"CPU_Usage": [45.0, 50.5, np.nan, 52.0, np.nan],
"Memory_GB": [12.0, np.nan, 12.5, np.nan, 13.0],
"Sessions": [100, 100, 105, np.nan, 120]
}
df = pd.DataFrame(server_logs)
print("--- Original Dataset (With Missing Values) ---")
print(df)
print("\n")
# 2. Filling with a Fixed Value (e.g., 0)
print("=== Filling with Fixed Value (fillna) ===")
# Fill all missing values with 0
# Used when "no data" implies "not running"
df_fill_zero = df.fillna(0)
print("--- Filled with 0 ---")
print(df_fill_zero)
# Filling only a specific column with a fixed value
# Process a copy to avoid affecting the original DataFrame
df_col_fill = df.copy()
df_col_fill["CPU_Usage"] = df_col_fill["CPU_Usage"].fillna(0)
print("\n--- CPU Column only filled with 0 ---")
print(df_col_fill["CPU_Usage"])
print("\n")
# 3. Filling with Statistics (Mean, Median, Mode)
print("=== Filling with Statistics ===")
# Fill with Mean
# Common when you want to maintain data distribution
df_fill_mean = df.fillna(df.mean())
print("--- Filled with Mean ---")
print(df_fill_mean)
print("(CPU Mean: {:.2f}, Mem Mean: {:.2f})".format(df["CPU_Usage"].mean(), df["Memory_GB"].mean()))
# Fill with Median
# Less affected by outliers
df_fill_median = df.fillna(df.median())
print("\n--- Filled with Median ---")
print(df_fill_median)
# Fill with Mode
# Used when you want to adopt the most frequent value (e.g., session counts)
# mode() returns a DataFrame, so we need to get the first row with iloc[0]
mode_values = df.mode().iloc[0]
df_fill_mode = df.fillna(mode_values)
print("\n--- Filled with Mode ---")
print(df_fill_mode)
print("\n")
# 4. Filling with Surrounding Values (For Time Series)
print("=== Filling with Surrounding Values (ffill / bfill) ===")
# Forward Fill (ffill)
# Copies the last valid value forward
# Valid when assuming "the log was interrupted, but the previous state continues"
df_ffill = df.ffill()
print("--- Forward Fill (ffill) ---")
print(df_ffill)
# Backward Fill (bfill)
# Copies the next valid value backward
df_bfill = df.bfill()
print("\n--- Backward Fill (bfill) ---")
print(df_bfill)
if __name__ == "__main__":
demonstrate_fillna()
Execution Results
--- Original Dataset (With Missing Values) ---
CPU_Usage Memory_GB Sessions
0 45.0 12.0 100.0
1 50.5 NaN 100.0
2 NaN 12.5 105.0
3 52.0 NaN NaN
4 NaN 13.0 120.0
=== Filling with Fixed Value (fillna) ===
--- Filled with 0 ---
CPU_Usage Memory_GB Sessions
0 45.0 12.0 100.0
1 50.5 0.0 100.0
2 0.0 12.5 105.0
3 52.0 0.0 0.0
4 0.0 13.0 120.0
--- CPU Column only filled with 0 ---
0 45.0
1 50.5
2 0.0
3 52.0
4 0.0
Name: CPU_Usage, dtype: float64
=== Filling with Statistics ===
--- Filled with Mean ---
CPU_Usage Memory_GB Sessions
0 45.000000 12.00 100.00
1 50.500000 12.50 100.00
2 49.166667 12.50 105.00
3 52.000000 12.50 106.25
4 49.166667 13.00 120.00
(CPU Mean: 49.17, Mem Mean: 12.50)
--- Filled with Median ---
CPU_Usage Memory_GB Sessions
0 45.0 12.0 100.0
1 50.5 12.5 100.0
2 50.5 12.5 105.0
3 52.0 12.5 102.5
4 50.5 13.0 120.0
--- Filled with Mode ---
CPU_Usage Memory_GB Sessions
0 45.0 12.0 100.0
1 50.5 12.0 100.0
2 45.0 12.5 105.0
3 52.0 12.0 100.0
4 45.0 13.0 120.0
=== Filling with Surrounding Values (ffill / bfill) ===
--- Forward Fill (ffill) ---
CPU_Usage Memory_GB Sessions
0 45.0 12.0 100.0
1 50.5 12.0 100.0
2 50.5 12.5 105.0
3 52.0 12.5 105.0
4 52.0 13.0 120.0
--- Backward Fill (bfill) ---
CPU_Usage Memory_GB Sessions
0 45.0 12.0 100.0
1 50.5 12.5 100.0
2 52.0 12.5 105.0
3 52.0 13.0 120.0
4 NaN 13.0 120.0
Explanation: Behavior of ffill and bfill
In time-series data processing, ffill (forward fill) and bfill (backward fill) are very important.
- ffill: Fills by assuming “the value of the previous time period continues.” Looking at index 2 (Memory_GB) in the execution result, you can see that the value of the immediately preceding index 0 (12.0) has been copied.
- bfill: Used when you want to “infer from the value of the subsequent time period.” However, be aware that if no value exists afterwards, as with the end of the data (CPU_Usage at index 4), it remains
NaN.
