When replacing strings in Python, the standard replace() method is sufficient for simple, fixed-text substitutions.
However, when you need to perform flexible replacements based on patterns—such as “all whitespace characters” or “numbers only”—the sub() function from the regular expression module (re) is extremely powerful.
In this article, I will explain how to bulk convert strings containing spaces (like user input) into a system-friendly format (such as snake_case).
Table of Contents
- Basic Syntax of re.sub
- Implementation Example: Filename Normalization
- Code Explanation
- Summary
1. Basic Syntax of re.sub
re.sub is a function that replaces parts of a string that match a regular expression pattern with a specified string.
Syntax:
re.sub(regex_pattern, replacement_string, target_text)
The biggest advantage of using this function is that you can specify various types of characters—such as spaces (half-width/full-width), tabs, and newlines—using a single condition (meta-character).
2. Implementation Example: Filename Normalization
Below is an example of code that replaces “whitespace” in filenames (e.g., uploaded by users) with “underscores (_)” to prevent system errors.
import re
def main():
# Simulated user-input filename
# Contains a mix of half-width and full-width spaces
raw_filename = "My Project Plan 2025.pdf"
print(f"Original: '{raw_filename}'")
# Replacement using regular expressions
# r"\s" is a meta-character representing "whitespace characters"
# (matches half-width spaces, full-width spaces, tabs, newlines, etc.)
normalized_filename = re.sub(r"\s", "_", raw_filename)
# Output result
print(f"Normalized: '{normalized_filename}'")
# --- Application: Converting consecutive spaces to a single underscore ---
# By using r"\s+", multiple consecutive spaces are combined into one
messy_text = "Data Analysis Report.txt"
clean_text = re.sub(r"\s+", "_", messy_text)
print(f"Consecutive spaces: '{messy_text}' -> '{clean_text}'")
if __name__ == "__main__":
main()
3. Code Explanation
The Power of Meta-character \s
The regular expression r"\s" used in the code matches all “whitespace characters” in Unicode. This includes:
- Half-width space ()
- Tab (
\t) - Newline (
\n,\r) - Full-width space (commonly used in Japanese environments)
While the standard replace(" ", "_") can only replace half-width spaces (leaving full-width spaces and tabs behind), re.sub(r"\s", ...) handles all of them at once.
Recommendation of Raw Strings r”…”
When writing regular expression patterns, it is recommended to use Raw Strings by prefixing the string with r. This tells Python to treat backslashes (\) as literal characters rather than escape characters. This is the standard practice for handling regular expressions in Python.
4. Summary
By using re.sub, you can perform advanced replacement operations based on character types and patterns, rather than just simple text matching.
\s: Targets all whitespace characters.\d: Targets digits (e.g., masking numbers in personal information).+: Allows repetition of the preceding pattern, replacing consecutive characters as a single group.
This function is essential for data cleansing and formatting tasks.
