[Python] Efficient String Replacement with Regex re.sub

2025年12月8日

When replacing strings in Python, the standard replace() method is sufficient for simple, fixed-text substitutions.

However, when you need to perform flexible replacements based on patterns—such as “all whitespace characters” or “numbers only”—the sub() function from the regular expression module (re) is extremely powerful.

In this article, I will explain how to bulk convert strings containing spaces (like user input) into a system-friendly format (such as snake_case).

Basic Syntax of re.sub
Implementation Example: Filename Normalization
Code Explanation
Summary

1. Basic Syntax of re.sub

re.sub is a function that replaces parts of a string that match a regular expression pattern with a specified string.

Syntax:

re.sub(regex_pattern, replacement_string, target_text)

The biggest advantage of using this function is that you can specify various types of characters—such as spaces (half-width/full-width), tabs, and newlines—using a single condition (meta-character).

2. Implementation Example: Filename Normalization

Below is an example of code that replaces “whitespace” in filenames (e.g., uploaded by users) with “underscores (_)” to prevent system errors.

import re

def main():
    # Simulated user-input filename
    # Contains a mix of half-width and full-width spaces
    raw_filename = "My Project Plan　2025.pdf"

    print(f"Original: '{raw_filename}'")

    # Replacement using regular expressions
    # r"\s" is a meta-character representing "whitespace characters"
    # (matches half-width spaces, full-width spaces, tabs, newlines, etc.)
    normalized_filename = re.sub(r"\s", "_", raw_filename)

    # Output result
    print(f"Normalized: '{normalized_filename}'")

    # --- Application: Converting consecutive spaces to a single underscore ---
    # By using r"\s+", multiple consecutive spaces are combined into one
    messy_text = "Data    Analysis   Report.txt"
    clean_text = re.sub(r"\s+", "_", messy_text)
    print(f"Consecutive spaces: '{messy_text}' -> '{clean_text}'")

if __name__ == "__main__":
    main()

3. Code Explanation

The Power of Meta-character \s

The regular expression r"\s" used in the code matches all “whitespace characters” in Unicode. This includes:

Half-width space ()
Tab (\t)
Newline (\n, \r)
Full-width space (commonly used in Japanese environments)

While the standard replace(" ", "_") can only replace half-width spaces (leaving full-width spaces and tabs behind), re.sub(r"\s", ...) handles all of them at once.

Recommendation of Raw Strings r”…”

When writing regular expression patterns, it is recommended to use Raw Strings by prefixing the string with r. This tells Python to treat backslashes (\) as literal characters rather than escape characters. This is the standard practice for handling regular expressions in Python.

4. Summary

By using re.sub, you can perform advanced replacement operations based on character types and patterns, rather than just simple text matching.

\s: Targets all whitespace characters.
\d: Targets digits (e.g., masking numbers in personal information).
+: Allows repetition of the preceding pattern, replacing consecutive characters as a single group.

This function is essential for data cleansing and formatting tasks.

よかったらシェアしてね！