[Python] Filling Missing Values in Pandas (fillna / ffill / bfill)

In the process of data analysis, simply deleting missing values (NaN) carries the risk of reducing the amount of important data or losing the continuity of time series. Therefore, it is common to fill (impute) these holes with “0”, the “mean”, or “surrounding values” depending on the nature of the data.

This article explains how to fill missing values with appropriate values using the Pandas fillna, ffill, and bfill methods.

目次

Basic Strategy for Filling Missing Values

The appropriate method for filling depends on the type of data.

  • Filling with a fixed value: Using numerical 0 or the string "Unknown".
  • Filling with statistics: Using the mean, median, or mode. This is effective when you do not want to distort the distribution of the data.
  • Filling with surrounding values: Effective for time-series data where it can be assumed that the most recent state continues.

Implementation Sample Code

Here, we will use a web server’s monitoring logs (CPU usage, memory usage, active session count) as the subject. We start with a state where some data is missing due to communication errors or similar issues.

import pandas as pd
import numpy as np

def demonstrate_fillna():
    """
    Function to demonstrate various methods for filling missing values using Pandas
    """
    
    # 1. Create Sample Data (Server Monitoring Logs)
    # CPU: Percentage
    # Memory: Usage (GB)
    # Sessions: Connection count
    server_logs = {
        "CPU_Usage": [45.0, 50.5, np.nan, 52.0, np.nan],
        "Memory_GB": [12.0, np.nan, 12.5, np.nan, 13.0],
        "Sessions":  [100, 100, 105, np.nan, 120]
    }
    
    df = pd.DataFrame(server_logs)
    
    print("--- Original Dataset (With Missing Values) ---")
    print(df)
    print("\n")


    # 2. Filling with a Fixed Value (e.g., 0)
    print("=== Filling with Fixed Value (fillna) ===")
    
    # Fill all missing values with 0
    # Used when "no data" implies "not running"
    df_fill_zero = df.fillna(0)
    
    print("--- Filled with 0 ---")
    print(df_fill_zero)
    
    # Filling only a specific column with a fixed value
    # Process a copy to avoid affecting the original DataFrame
    df_col_fill = df.copy()
    df_col_fill["CPU_Usage"] = df_col_fill["CPU_Usage"].fillna(0)
    print("\n--- CPU Column only filled with 0 ---")
    print(df_col_fill["CPU_Usage"])
    print("\n")


    # 3. Filling with Statistics (Mean, Median, Mode)
    print("=== Filling with Statistics ===")
    
    # Fill with Mean
    # Common when you want to maintain data distribution
    df_fill_mean = df.fillna(df.mean())
    print("--- Filled with Mean ---")
    print(df_fill_mean)
    print("(CPU Mean: {:.2f}, Mem Mean: {:.2f})".format(df["CPU_Usage"].mean(), df["Memory_GB"].mean()))

    # Fill with Median
    # Less affected by outliers
    df_fill_median = df.fillna(df.median())
    print("\n--- Filled with Median ---")
    print(df_fill_median)

    # Fill with Mode
    # Used when you want to adopt the most frequent value (e.g., session counts)
    # mode() returns a DataFrame, so we need to get the first row with iloc[0]
    mode_values = df.mode().iloc[0]
    df_fill_mode = df.fillna(mode_values)
    print("\n--- Filled with Mode ---")
    print(df_fill_mode)
    print("\n")


    # 4. Filling with Surrounding Values (For Time Series)
    print("=== Filling with Surrounding Values (ffill / bfill) ===")
    
    # Forward Fill (ffill)
    # Copies the last valid value forward
    # Valid when assuming "the log was interrupted, but the previous state continues"
    df_ffill = df.ffill()
    print("--- Forward Fill (ffill) ---")
    print(df_ffill)
    
    # Backward Fill (bfill)
    # Copies the next valid value backward
    df_bfill = df.bfill()
    print("\n--- Backward Fill (bfill) ---")
    print(df_bfill)

if __name__ == "__main__":
    demonstrate_fillna()

Execution Results

--- Original Dataset (With Missing Values) ---
   CPU_Usage  Memory_GB  Sessions
0       45.0       12.0     100.0
1       50.5        NaN     100.0
2        NaN       12.5     105.0
3       52.0        NaN       NaN
4        NaN       13.0     120.0


=== Filling with Fixed Value (fillna) ===
--- Filled with 0 ---
   CPU_Usage  Memory_GB  Sessions
0       45.0       12.0     100.0
1       50.5        0.0     100.0
2        0.0       12.5     105.0
3       52.0        0.0       0.0
4        0.0       13.0     120.0

--- CPU Column only filled with 0 ---
0    45.0
1    50.5
2     0.0
3    52.0
4     0.0
Name: CPU_Usage, dtype: float64


=== Filling with Statistics ===
--- Filled with Mean ---
   CPU_Usage  Memory_GB  Sessions
0  45.000000      12.00    100.00
1  50.500000      12.50    100.00
2  49.166667      12.50    105.00
3  52.000000      12.50    106.25
4  49.166667      13.00    120.00
(CPU Mean: 49.17, Mem Mean: 12.50)

--- Filled with Median ---
   CPU_Usage  Memory_GB  Sessions
0       45.0       12.0     100.0
1       50.5       12.5     100.0
2       50.5       12.5     105.0
3       52.0       12.5     102.5
4       50.5       13.0     120.0

--- Filled with Mode ---
   CPU_Usage  Memory_GB  Sessions
0       45.0       12.0     100.0
1       50.5       12.0     100.0
2       45.0       12.5     105.0
3       52.0       12.0     100.0
4       45.0       13.0     120.0


=== Filling with Surrounding Values (ffill / bfill) ===
--- Forward Fill (ffill) ---
   CPU_Usage  Memory_GB  Sessions
0       45.0       12.0     100.0
1       50.5       12.0     100.0
2       50.5       12.5     105.0
3       52.0       12.5     105.0
4       52.0       13.0     120.0

--- Backward Fill (bfill) ---
   CPU_Usage  Memory_GB  Sessions
0       45.0       12.0     100.0
1       50.5       12.5     100.0
2       52.0       12.5     105.0
3       52.0       13.0     120.0
4        NaN       13.0     120.0

Explanation: Behavior of ffill and bfill

In time-series data processing, ffill (forward fill) and bfill (backward fill) are very important.

  • ffill: Fills by assuming “the value of the previous time period continues.” Looking at index 2 (Memory_GB) in the execution result, you can see that the value of the immediately preceding index 0 (12.0) has been copied.
  • bfill: Used when you want to “infer from the value of the subsequent time period.” However, be aware that if no value exists afterwards, as with the end of the data (CPU_Usage at index 4), it remains NaN.
よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次