When reading CSV files from external systems or text files created in legacy environments, you may encounter “garbled text” or errors because the character encoding is unknown.
To estimate file encoding in Python, the chardet library is the de facto standard. In this article, I will explain how to properly detect encoding based on file size and provide implementation code.
Table of Contents
- Installing chardet
- Detecting Encoding for Small Files
- High-Speed Detection for Large Files
- About the Result Dictionary
Installing chardet
Since chardet is an external library, install it using the following command:
pip install chardet
1. Detecting Encoding for Small Files
For configuration files or small text data, the easiest method is to read the entire content of the file and then perform detection.
Implementation Example: Batch Detection
The following code reads the entire file in binary mode and estimates the character code from that byte sequence.
import chardet
def detect_small_file_encoding(file_path):
"""
Detects the character encoding of a small file and displays the result.
"""
try:
# It is mandatory to open the file in binary mode ('rb')
with open(file_path, mode="rb") as f:
# Read the entire file as bytes
raw_data = f.read()
# Pass the bytes to chardet.detect() for detection
result = chardet.detect(raw_data)
# Output the results
print(f"--- Result for {file_path} ---")
print(f"Estimated Encoding: {result['encoding']}")
print(f"Confidence : {result['confidence']}")
print(f"Language : {result['language']}")
print(f"Details : {result}")
except FileNotFoundError:
print(f"Error: File '{file_path}' was not found.")
if __name__ == "__main__":
# Create a dummy file for testing
# (In practice, specify the path of an existing file you want to analyze)
target_file = "sample_sjis.txt"
with open(target_file, "w", encoding="shift_jis") as f:
f.write("This is a sample text written in Shift_JIS.")
detect_small_file_encoding(target_file)
2. High-Speed Detection for Large Files
For files ranging from hundreds of MB to several GB, such as log files, reading the entire file into memory using chardet.detect() can cause memory shortages or processing delays.
In such cases, use the UniversalDetector class. This allows you to read data little by little (stream processing) and stop the process as soon as the encoding is determined. This enables high-speed, low-memory operation even for huge files.
Implementation Example: Detecting Huge Log Files
import chardet
from chardet.universaldetector import UniversalDetector
def detect_large_file_encoding(file_path):
"""
Reads a large file incrementally to efficiently detect character encoding.
"""
# Create an instance of the detector
detector = UniversalDetector()
try:
with open(file_path, mode="rb") as f:
# Read the file line by line (binary)
for binary_line in f:
# Feed data to the detector
detector.feed(binary_line)
# Check if detection is complete (if done becomes True, it is determined)
if detector.done:
break
# Always call close() when data feeding is finished
detector.close()
# Get the result
result = detector.result
print(f"--- Result for {file_path} (Large File) ---")
print(f"Estimated Encoding: {result['encoding']}")
print(f"Confidence : {result['confidence']}")
except FileNotFoundError:
print(f"Error: File '{file_path}' was not found.")
if __name__ == "__main__":
# Dummy file for testing
large_target = "system_log_utf8.log"
with open(large_target, "w", encoding="utf_8") as f:
# Usually, only a small amount of data is needed for detection
f.write("2023-10-01 [INFO] Starting server boot process.\n" * 100)
detect_large_file_encoding(large_target)
About the Result Dictionary
The detection result (result) is returned as a dictionary with the following keys:
encoding: The estimated character code (e.g.,'utf-8','SHIFT_JIS','EUC-JP'). ReturnsNoneif detection fails.confidence: The certainty of the detection. A floating-point number between 0.0 and 1.0; the closer to 1.0, the higher the reliability.language: The estimated language (e.g.,'Japanese').
In batch processing handling an unspecified number of files, it is recommended to check this confidence value. If the reliability is low, you should implement logic to skip processing or issue a warning.
