In network communication, reading and writing binary files, or integrating with legacy systems, converting between human-readable strings (str) and computer-readable bytes (bytes) is an unavoidable process.
Python uses the encode() and decode() methods to perform these conversions. Especially in Japanese environments, failing to specify the appropriate character code (encoding)—such as shift_jis, utf_8, or utf_8_sig (with Byte Order Mark)—can lead to “garbled text” or errors.
In this article, I will explain how to properly convert data between different character codes, assuming a scenario where you are receiving data from IoT devices.
Table of Contents
- Logic of String and Byte Conversion
- Implementation Example: Data Conversion for Legacy Device Communication
- Technical Points
1. Logic of String and Byte Conversion
- Encode: Converting a String (
str) into Bytes (bytes). This is the process of turning “characters” into a “sequence of numbers.” - Decode: Converting Bytes (
bytes) into a String (str). This is the process of turning a “sequence of numbers” back into human-readable “characters.”
2. Implementation Example: Data Conversion for Legacy Device Communication
The following code simulates the exchange of status strings between an old measurement device using Shift_JIS and a modern system using UTF-8.
def main():
# 1. Conversion from string to bytes (encode)
# Status data we want to send ("Operating normally")
original_text = "正常動作中"
print(f"Original String: {original_text}\n")
# Case A: Encode in UTF-8 (Modern Standard)
# Used in general Web APIs and Linux environments
bytes_utf8 = original_text.encode("utf_8")
print(f"UTF-8 Bytes : {bytes_utf8}")
# Output example: b'\xe6\xad\xa3\xe5\xb8\xb8...' (Basically 3 bytes per Japanese character)
# Case B: Encode in Shift_JIS (Legacy Environment)
# Used in old Windows apps and older Japanese embedded devices
bytes_sjis = original_text.encode("shift_jis")
print(f"Shift_JIS Bytes: {bytes_sjis}")
# Output example: b'\x90\xb3\x8f\xed...' (Basically 2 bytes per Japanese character)
# Case C: UTF-8 with BOM (utf_8_sig)
# Format used to prevent garbled text when opening CSVs in Excel
bytes_bom = original_text.encode("utf_8_sig")
print(f"UTF-8(BOM) Bytes: {bytes_bom}")
# A signature (BOM) \xef\xbb\xbf is added to the beginning
print("\n" + "="*30 + "\n")
# 2. Restoration from bytes to string (decode)
# Assuming this is byte data received from outside, we convert it back to a string
# Correctly decoding the Shift_JIS byte sequence
# You must specify the same format used during encoding
received_text = bytes_sjis.decode("shift_jis")
print(f"[Received/Restored] Shift_JIS Data: {received_text}")
# 3. Error when trying to decode with a different encoding
try:
# Trying to read Shift_JIS data as UTF-8 will cause an error
_ = bytes_sjis.decode("utf_8")
except UnicodeDecodeError as e:
print(f"[Conversion Error]: {e}")
if __name__ == "__main__":
main()
3. Technical Points
Typical Encodings
utf_8: The most standard global format. Japanese characters are usually represented by 3 bytes.shift_jis(orcp932): A format widely used in Japanese Windows environments and older systems. Japanese characters are usually 2 bytes.ascii: Only for alphanumeric characters. An error occurs if Japanese characters are included.utf_8_sig: A format with a mark (BOM) at the beginning of the file indicating “This is UTF-8.” If this is missing, text may appear garbled when handling CSV files in Windows Excel.
Error Handling
If the encoding formats do not match, a UnicodeDecodeError will occur. When importing external data, it is crucial to confirm which character code the data uses (e.g., by checking the specifications). While libraries like chardet can estimate the encoding if it is unknown, the best practice is to explicitly specify it according to the specifications.
