[Python] Splitting Strings with Multiple Delimiters Using re.split

2025年12月11日

When splitting a string into a list in Python, the standard split() method is often used. However, since split() can only specify a single delimiter, it is unsuitable for processing data where commas, spaces, and other separators are mixed together.

By using the split() function from the re (regular expression) module, you can specify multiple delimiters or complex patterns to flexibly split strings.

Basics and Benefits of re.split
Implementation Example: Splitting Tag Information with Inconsistent Formatting
Code Explanation
Summary

1. Basics and Benefits of `re.split`

re.split is a function that splits a string at the points matching a regular expression pattern.

Syntax:

re.split(regex_pattern, target_string)

The greatest advantage is the ability to use negative character classes ([^...]) to specify rules like “treat everything except specific characters as a delimiter.” This makes it easy to clean “messy data” where the delimiter format is not unified.

2. Implementation Example: Splitting Tag Information with Inconsistent Formatting

Here is an example code that converts a list of tags from user input—where delimiters are inconsistent (commas, spaces, semicolons)—into a clean list format.

import re

def main():
    # Input data with inconsistent delimiters
    # A mix of commas, semicolons, and spaces
    raw_tags = "Python, Programming;Code  Development"

    print(f"Original: '{raw_tags}'")

    # Splitting using regular expressions
    # Explanation of r"[^a-zA-Z0-9]+":
    # [^...]    : Matches characters "other than" those in brackets (Negation)
    # a-zA-Z0-9 : Alphanumeric characters
    # +         : One or more repetitions
    # In short, it treats "blocks of non-alphanumeric characters" as delimiters.
    tag_list = re.split(r"[^a-zA-Z0-9]+", raw_tags)

    # Output result
    print(f"Split List: {tag_list}")

    # Note: If there are delimiters at the start or end of the string, empty strings may be included.
    # To remove them, use list comprehension for filtering.
    clean_list = [t for t in tag_list if t]
    print(f"Clean List: {clean_list}")

if __name__ == "__main__":
    main()

3. Code Explanation

Utilizing Negative Character Classes `[^...]`

The regular expression r"[^a-zA-Z0-9]+" used in the code is set to consider “anything that is not a letter (a-z, A-Z) or a number (0-9)” as a delimiter.

This allows commas (,), semicolons (;), and spaces () to be processed as “separators” without having to specify them individually.

Importance of `+` (One or more repetitions)

The + at the end of the regular expression is very important. Without it, if there are two consecutive spaces (e.g., ), splitting would occur between them, generating an empty string in the list.

By specifying +, consecutive delimiters (e.g., , or ) are treated together as “one large delimiter,” resulting in a cleaner list.

4. Summary

re.split is powerful for analyzing complex text data that cannot be handled by simple delimiter specification.

str.split(): Fast and suitable when data is clean and the delimiter is clear.
re.split(): Effective when delimiters are unclear or when multiple types are mixed.

This technique is particularly useful in log analysis and Natural Language Processing (NLP) preprocessing for creating word lists while removing unnecessary symbols.

よかったらシェアしてね！