[Python] High-Speed Full-width/Half-width Conversion with mojimoji: The Definitive Guide to Data Normalization

2025年12月8日

When handling text data containing mixed Full-width (Zenkaku) and Half-width (Hankaku) characters—such as product data on e-commerce sites or customer lists—”normalization” is an essential process to standardize inconsistent formatting.

While standard Python libraries can handle these conversions, the mojimoji library is overwhelmingly faster and offers finer control for Japanese text processing.

In this article, assuming a product management system scenario, I will explain how to neatly format messy product data using mojimoji.

Installing mojimoji
Implementation Example: Normalizing Product Data
Code Explanation

1. Installing mojimoji

mojimoji is a high-speed external library implemented in C.

pip install mojimoji

2. Implementation Example: Normalizing Product Data

A common requirement is: “We want to unify English letters and numbers to half-width, but keep Katakana in full-width for readability.” With mojimoji, you can control this behavior with a single argument.

The following code demonstrates how to format product data that contains a mixture of full-width alphanumerics, full-width Katakana, and full-width symbols.

import mojimoji

def main():
    # Product name with inconsistent formatting registered in the database
    # "Ｓｕｐｅｒ　ＰＣ　モデルＡ　１９８０００円"
    # Contains full-width alphabets, full-width spaces, full-width Katakana, and full-width numbers.
    product_name = "Ｓｕｐｅｒ　ＰＣ　モデルＡ　１９８０００円"

    print(f"Original Data:\n{product_name}\n")

    # 1. Convert everything to half-width (Default)
    # Alphanumerics, spaces, and Katakana all become half-width.
    normalized_all = mojimoji.zen_to_han(product_name)
    print(f"Convert All (All Half-width): {normalized_all}")

    # 2. Keep Katakana 'Full-width', convert only alphanumerics/symbols (kana=False)
    # Ideal when you want to unify alphanumerics while maintaining Japanese readability.
    normalized_smart = mojimoji.zen_to_han(product_name, kana=False)
    print(f"Exclude Kana (Alphanumerics only): {normalized_smart}")

    # 3. Keep numbers 'Full-width' (digit=False)
    # Use this if you want to maintain price notations or numbers in full-width.
    normalized_no_digit = mojimoji.zen_to_han(product_name, digit=False)
    print(f"Exclude Digits (Keep numbers Full-width): {normalized_no_digit}")

    # 4. Keep alphabets 'Full-width' (ascii=False)
    normalized_no_ascii = mojimoji.zen_to_han(product_name, ascii=False)
    print(f"Exclude ASCII (Keep alphabets Full-width): {normalized_no_ascii}")

    # --- Example of Inverse Conversion ---
    # 5. Convert Half-width back to Full-width (han_to_zen)
    # Useful when a legacy system requires a fixed full-width format.
    half_width_text = "MacBook Pro M3"
    full_width_text = mojimoji.han_to_zen(half_width_text)
    print(f"\n[Inverse] Half -> Full: {full_width_text}")

if __name__ == "__main__":
    main()

3. Code Explanation

`mojimoji.zen_to_han(text, ...)`

This function converts full-width characters to half-width characters. The powerful feature of this function is that you can finely filter the conversion targets using the following optional arguments:

kana=False: Excludes Katakana characters from conversion. This is very useful when you want to handle Katakana (e.g., product names or phonetic readings of names) in full-width.
digit=False: Excludes numbers (０-９) from conversion.
ascii=False: Excludes alphabets and symbols (ASCII characters) from conversion.

Practical Use Cases

While it is possible to write similar processing using regular expressions, mojimoji is implemented simply, is less prone to bugs, and is significantly faster. It is particularly effective in batch processing that handles large amounts of log data or CSV files.

よかったらシェアしてね！