[Python] Converting Between Strings (str) and Bytes (bytes) and Specifying Encodings

In network communication, reading and writing binary files, or integrating with legacy systems, converting between human-readable strings (str) and computer-readable bytes (bytes) is an unavoidable process.

Python uses the encode() and decode() methods to perform these conversions. Especially in Japanese environments, failing to specify the appropriate character code (encoding)—such as shift_jis, utf_8, or utf_8_sig (with Byte Order Mark)—can lead to “garbled text” or errors.

In this article, I will explain how to properly convert data between different character codes, assuming a scenario where you are receiving data from IoT devices.

目次

Table of Contents

  1. Logic of String and Byte Conversion
  2. Implementation Example: Data Conversion for Legacy Device Communication
  3. Technical Points

1. Logic of String and Byte Conversion

  • Encode: Converting a String (str) into Bytes (bytes). This is the process of turning “characters” into a “sequence of numbers.”
  • Decode: Converting Bytes (bytes) into a String (str). This is the process of turning a “sequence of numbers” back into human-readable “characters.”

2. Implementation Example: Data Conversion for Legacy Device Communication

The following code simulates the exchange of status strings between an old measurement device using Shift_JIS and a modern system using UTF-8.

def main():
    # 1. Conversion from string to bytes (encode)
    # Status data we want to send ("Operating normally")
    original_text = "正常動作中"

    print(f"Original String: {original_text}\n")

    # Case A: Encode in UTF-8 (Modern Standard)
    # Used in general Web APIs and Linux environments
    bytes_utf8 = original_text.encode("utf_8")
    print(f"UTF-8 Bytes   : {bytes_utf8}")
    # Output example: b'\xe6\xad\xa3\xe5\xb8\xb8...' (Basically 3 bytes per Japanese character)

    # Case B: Encode in Shift_JIS (Legacy Environment)
    # Used in old Windows apps and older Japanese embedded devices
    bytes_sjis = original_text.encode("shift_jis")
    print(f"Shift_JIS Bytes: {bytes_sjis}")
    # Output example: b'\x90\xb3\x8f\xed...' (Basically 2 bytes per Japanese character)

    # Case C: UTF-8 with BOM (utf_8_sig)
    # Format used to prevent garbled text when opening CSVs in Excel
    bytes_bom = original_text.encode("utf_8_sig")
    print(f"UTF-8(BOM) Bytes: {bytes_bom}")
    # A signature (BOM) \xef\xbb\xbf is added to the beginning

    print("\n" + "="*30 + "\n")

    # 2. Restoration from bytes to string (decode)
    # Assuming this is byte data received from outside, we convert it back to a string

    # Correctly decoding the Shift_JIS byte sequence
    # You must specify the same format used during encoding
    received_text = bytes_sjis.decode("shift_jis")
    print(f"[Received/Restored] Shift_JIS Data: {received_text}")

    # 3. Error when trying to decode with a different encoding
    try:
        # Trying to read Shift_JIS data as UTF-8 will cause an error
        _ = bytes_sjis.decode("utf_8")
    except UnicodeDecodeError as e:
        print(f"[Conversion Error]: {e}")

if __name__ == "__main__":
    main()

3. Technical Points

Typical Encodings

  • utf_8: The most standard global format. Japanese characters are usually represented by 3 bytes.
  • shift_jis (or cp932): A format widely used in Japanese Windows environments and older systems. Japanese characters are usually 2 bytes.
  • ascii: Only for alphanumeric characters. An error occurs if Japanese characters are included.
  • utf_8_sig: A format with a mark (BOM) at the beginning of the file indicating “This is UTF-8.” If this is missing, text may appear garbled when handling CSV files in Windows Excel.

Error Handling

If the encoding formats do not match, a UnicodeDecodeError will occur. When importing external data, it is crucial to confirm which character code the data uses (e.g., by checking the specifications). While libraries like chardet can estimate the encoding if it is unknown, the best practice is to explicitly specify it according to the specifications.

よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次