[Python] Avoiding Garbled Text with Requests: Manual Setting and Automatic Detection of Response Encoding

The Python requests library automatically guesses the character encoding based on HTTP response headers. However, if the header lacks information or contains incorrect information (e.g., a Shift_JIS site being detected as ISO-8859-1), accessing r.text will result in “garbled text” (Mojibake).

In such cases, you can resolve the issue by manually assigning the correct character code to the r.encoding property.

目次

Executable Sample Code

The following code demonstrates the flow of retrieving data from a website and manually setting the encoding to get the correct text. I also introduce a technique using apparent_encoding, which guesses high-precision encoding from the content itself.

import requests

def fetch_and_fix_encoding():
    # Assuming older sites or specific Japanese sites
    # (Here, we use Yahoo! JAPAN's top page as an example)
    target_url = "https://www.yahoo.co.jp/"

    try:
        print(f"Connecting to: {target_url}")
        response = requests.get(target_url, timeout=10)

        # -------------------------------------------------------
        # 1. Check Default Behavior
        # -------------------------------------------------------
        # requests estimates encoding from the Content-Type header
        print(f"\n[Default Detection] Encoding: {response.encoding}")
        
        # -------------------------------------------------------
        # 2. Handling Garbled Text (Manual Setting)
        # -------------------------------------------------------
        # If the encoding is unintended (like ISO-8859-1), 
        # assign the correct encoding string before accessing .text.
        
        # Example: Explicitly specifying UTF-8
        # response.encoding = 'utf-8'
        
        # Example: Old Windows-based sites (Shift_JIS/CP932)
        # response.encoding = 'cp932'

        # [Recommended] Apply encoding estimated from the content body
        # apparent_encoding uses a library (chardet/charset_normalizer) 
        # to analyze the byte sequence and return the result.
        detected_encoding = response.apparent_encoding
        print(f"[Apparent Detection] Encoding: {detected_encoding}")

        # Adopt the estimated encoding as the official setting
        response.encoding = detected_encoding

        # -------------------------------------------------------
        # 3. Retrieve Text
        # -------------------------------------------------------
        # When .text is referenced after the setting change, 
        # it is decoded with the correct character code.
        page_title = extract_title(response.text)
        print(f"\nPage Title: {page_title}")

    except requests.RequestException as e:
        print(f"Error: {e}")

def extract_title(html_content):
    """Helper function to simply extract the content of the title tag from HTML"""
    try:
        # Simple string search (BeautifulSoup is recommended for actual work)
        start_tag = "<title>"
        end_tag = "</title>"
        start_idx = html_content.find(start_tag)
        end_idx = html_content.find(end_tag)
        
        if start_idx != -1 and end_idx != -1:
            return html_content[start_idx + len(start_tag) : end_idx]
        return "Title not found"
    except Exception:
        return "Extraction failed"

if __name__ == "__main__":
    fetch_and_fix_encoding()

Explanation: Why Text Gets Garbled and How to Fix It

1. Automatic Detection Logic of Requests

requests determines r.encoding by looking at the Content-Type header sent from the server (e.g., text/html; charset=utf-8).

However, if there is no charset specification in this header, it may default to ISO-8859-1 (Western European languages) according to HTTP standards (RFC 2616). When this happens on a Japanese site, the Japanese part becomes completely garbled.

2. Assigning to r.encoding

The r.text property decodes r.content (byte sequence) using the current value of r.encoding every time it is accessed. Therefore, by rewriting r.encoding before reading r.text, you can change the decoding method.

# Example of manual setting
r.encoding = 'utf-8'   # General websites
r.encoding = 'cp932'   # Old Japanese Windows-based sites (Shift_JIS extension)
r.encoding = 'euc-jp'  # Old UNIX-based sites

3. Utilizing r.apparent_encoding

If the character code is unknown, it is standard practice to use r.apparent_encoding. This feature guesses the character code from the byte sequence statistics of the response and returns the correct encoding (such as Windows-1252 or utf-8) with very high probability.

# Standard code for fixing garbled text
r.encoding = r.apparent_encoding
print(r.text)
よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次