When handling text data containing mixed Full-width (Zenkaku) and Half-width (Hankaku) characters—such as product data on e-commerce sites or customer lists—”normalization” is an essential process to standardize inconsistent formatting.
While standard Python libraries can handle these conversions, the mojimoji library is overwhelmingly faster and offers finer control for Japanese text processing.
In this article, assuming a product management system scenario, I will explain how to neatly format messy product data using mojimoji.
Table of Contents
- Installing mojimoji
- Implementation Example: Normalizing Product Data
- Code Explanation
1. Installing mojimoji
mojimoji is a high-speed external library implemented in C.
pip install mojimoji
2. Implementation Example: Normalizing Product Data
A common requirement is: “We want to unify English letters and numbers to half-width, but keep Katakana in full-width for readability.” With mojimoji, you can control this behavior with a single argument.
The following code demonstrates how to format product data that contains a mixture of full-width alphanumerics, full-width Katakana, and full-width symbols.
import mojimoji
def main():
# Product name with inconsistent formatting registered in the database
# "Super PC モデルA 198000円"
# Contains full-width alphabets, full-width spaces, full-width Katakana, and full-width numbers.
product_name = "Super PC モデルA 198000円"
print(f"Original Data:\n{product_name}\n")
# 1. Convert everything to half-width (Default)
# Alphanumerics, spaces, and Katakana all become half-width.
normalized_all = mojimoji.zen_to_han(product_name)
print(f"Convert All (All Half-width): {normalized_all}")
# 2. Keep Katakana 'Full-width', convert only alphanumerics/symbols (kana=False)
# Ideal when you want to unify alphanumerics while maintaining Japanese readability.
normalized_smart = mojimoji.zen_to_han(product_name, kana=False)
print(f"Exclude Kana (Alphanumerics only): {normalized_smart}")
# 3. Keep numbers 'Full-width' (digit=False)
# Use this if you want to maintain price notations or numbers in full-width.
normalized_no_digit = mojimoji.zen_to_han(product_name, digit=False)
print(f"Exclude Digits (Keep numbers Full-width): {normalized_no_digit}")
# 4. Keep alphabets 'Full-width' (ascii=False)
normalized_no_ascii = mojimoji.zen_to_han(product_name, ascii=False)
print(f"Exclude ASCII (Keep alphabets Full-width): {normalized_no_ascii}")
# --- Example of Inverse Conversion ---
# 5. Convert Half-width back to Full-width (han_to_zen)
# Useful when a legacy system requires a fixed full-width format.
half_width_text = "MacBook Pro M3"
full_width_text = mojimoji.han_to_zen(half_width_text)
print(f"\n[Inverse] Half -> Full: {full_width_text}")
if __name__ == "__main__":
main()
3. Code Explanation
mojimoji.zen_to_han(text, ...)
This function converts full-width characters to half-width characters. The powerful feature of this function is that you can finely filter the conversion targets using the following optional arguments:
kana=False: Excludes Katakana characters from conversion. This is very useful when you want to handle Katakana (e.g., product names or phonetic readings of names) in full-width.digit=False: Excludes numbers (0-9) from conversion.ascii=False: Excludes alphabets and symbols (ASCII characters) from conversion.
Practical Use Cases
While it is possible to write similar processing using regular expressions, mojimoji is implemented simply, is less prone to bugs, and is significantly faster. It is particularly effective in batch processing that handles large amounts of log data or CSV files.
