[Python] Automatically Detect File Encoding with chardet

2025年12月8日

When reading CSV files from external systems or text files created in legacy environments, you may encounter “garbled text” or errors because the character encoding is unknown.

To estimate file encoding in Python, the chardet library is the de facto standard. In this article, I will explain how to properly detect encoding based on file size and provide implementation code.

Installing chardet
Detecting Encoding for Small Files
High-Speed Detection for Large Files
About the Result Dictionary

Installing chardet

Since chardet is an external library, install it using the following command:

pip install chardet

1. Detecting Encoding for Small Files

For configuration files or small text data, the easiest method is to read the entire content of the file and then perform detection.

Implementation Example: Batch Detection

The following code reads the entire file in binary mode and estimates the character code from that byte sequence.

import chardet

def detect_small_file_encoding(file_path):
    """
    Detects the character encoding of a small file and displays the result.
    """
    try:
        # It is mandatory to open the file in binary mode ('rb')
        with open(file_path, mode="rb") as f:
            # Read the entire file as bytes
            raw_data = f.read()
            
            # Pass the bytes to chardet.detect() for detection
            result = chardet.detect(raw_data)
            
            # Output the results
            print(f"--- Result for {file_path} ---")
            print(f"Estimated Encoding: {result['encoding']}")
            print(f"Confidence        : {result['confidence']}")
            print(f"Language          : {result['language']}")
            print(f"Details           : {result}")

    except FileNotFoundError:
        print(f"Error: File '{file_path}' was not found.")

if __name__ == "__main__":
    # Create a dummy file for testing
    # (In practice, specify the path of an existing file you want to analyze)
    target_file = "sample_sjis.txt"
    with open(target_file, "w", encoding="shift_jis") as f:
        f.write("This is a sample text written in Shift_JIS.")

    detect_small_file_encoding(target_file)

2. High-Speed Detection for Large Files

For files ranging from hundreds of MB to several GB, such as log files, reading the entire file into memory using chardet.detect() can cause memory shortages or processing delays.

In such cases, use the UniversalDetector class. This allows you to read data little by little (stream processing) and stop the process as soon as the encoding is determined. This enables high-speed, low-memory operation even for huge files.

Implementation Example: Detecting Huge Log Files

import chardet
from chardet.universaldetector import UniversalDetector

def detect_large_file_encoding(file_path):
    """
    Reads a large file incrementally to efficiently detect character encoding.
    """
    # Create an instance of the detector
    detector = UniversalDetector()

    try:
        with open(file_path, mode="rb") as f:
            # Read the file line by line (binary)
            for binary_line in f:
                # Feed data to the detector
                detector.feed(binary_line)
                
                # Check if detection is complete (if done becomes True, it is determined)
                if detector.done:
                    break
        
        # Always call close() when data feeding is finished
        detector.close()
        
        # Get the result
        result = detector.result
        
        print(f"--- Result for {file_path} (Large File) ---")
        print(f"Estimated Encoding: {result['encoding']}")
        print(f"Confidence        : {result['confidence']}")
    
    except FileNotFoundError:
        print(f"Error: File '{file_path}' was not found.")

if __name__ == "__main__":
    # Dummy file for testing
    large_target = "system_log_utf8.log"
    with open(large_target, "w", encoding="utf_8") as f:
        # Usually, only a small amount of data is needed for detection
        f.write("2023-10-01 [INFO] Starting server boot process.\n" * 100)

    detect_large_file_encoding(large_target)

About the Result Dictionary

The detection result (result) is returned as a dictionary with the following keys:

encoding: The estimated character code (e.g., 'utf-8', 'SHIFT_JIS', 'EUC-JP'). Returns None if detection fails.
confidence: The certainty of the detection. A floating-point number between 0.0 and 1.0; the closer to 1.0, the higher the reliability.
language: The estimated language (e.g., 'Japanese').

In batch processing handling an unspecified number of files, it is recommended to check this confidence value. If the reliability is low, you should implement logic to skip processing or issue a warning.

よかったらシェアしてね！