[Python] Complete Guide to the Regex Module (Functions, Special Characters, and Raw Strings List)

The re module is indispensable for advanced text processing in Python. In this article, I will systematically explain its main functions, special characters (metacharacters), and the behavior of Raw Strings.

Regular expression syntax can often become complex, so this list is organized to serve as a quick reference during implementation.

目次

Table of Contents

  1. Using Raw Strings
  2. Key Functions of the re Module
  3. List of Common Regular Expressions (Metacharacters)
  4. Practical Code Examples
  5. Summary

1. Using Raw Strings

In regular expressions, the backslash (\) is frequently used as an escape character. However, since Python’s standard strings also treat the backslash as an escape character, passing a literal backslash to the regex engine requires writing it as \\, which can make the code messy and hard to read.

By using “Raw Strings” (prefixing the string with r), you can instruct Python to interpret backslashes literally. This dramatically improves readability.

import re

# Standard String: Requires extra escaping to represent backslashes (Hard to read)
# Intention: Matches \my-host\.*
pattern_normal = "\\\\my-host\\\\.*"

# Raw String: Can be written exactly as it appears (Recommended)
pattern_raw = r"\\my-host\\.*"

2. Key Functions of the re Module

The following table summarizes the four most frequently used functions and their return values.

Function NameReturn ValueExplanation
re.search()re.Match objectScans the entire string and returns the first match found. Returns None if not found.
re.findall()list (of strings)Finds all non-overlapping matches in the string and returns them as a list.
re.split()list (of strings)Splits the string by the occurrences of the pattern.
re.sub()str (replaced string)Replaces the matches of the pattern with a specified string.

3. List of Common Regular Expressions (Metacharacters)

This list explains the meanings and examples of special characters that construct regex patterns.

RegexMeaningExampleExplanation of Example
.Any character (except newline)a.cMatches “abc”, “a1c”, etc.
^Start of line^StartMatches if the line starts with “Start”.
$End of lineEnd$Matches if the line ends with “End”.
*0 or more occurrences of the preceding characterab*cMatches “ac”, “abc”, “abbc”.
+1 or more occurrences of the preceding characterab+cMatches “abc”, “abbc” (Does not match “ac”).
?0 or 1 occurrence of the preceding characterab?cMatches only “ac” or “abc”.
**``**Either pattern (OR)`Apple
(...)Grouping(ab)+Matches repetitions like “ab”, “abab”.
[...]Character set (Any one character)[a-z]Matches one lowercase alphabet letter.
[^...]Negative character set (Not in set)[^0-9]Matches any character that is not a number.
\Escape\.Treats the special character . as a literal dot.
{n}Exactly n occurrences\d{3}Matches exactly 3 digits (e.g., 123).
{n,}n or more occurrences\d{3,}Matches 3 or more digits.
{n,m}Between n and m occurrences\d{3,5}Matches between 3 and 5 digits.

4. Practical Code Examples

Below is an example code that combines the functions and regex symbols explained above to extract and format necessary information from unstructured text data. Here, we assume a scenario of extracting product IDs and prices from an order email.

import re

def main():
    # Text data to be analyzed (e.g., excerpt from an order email)
    text_data = """
    Order Date: 2023-11-05
    [Item] ID:PROD-001 Price:1200JPY Note:SALE
    [Item] ID:PROD-888 Price:3500JPY Note:Standard
    [Item] ID:ACC-99   Price:500JPY  Note:Bulk
    End of list.
    """

    print("--- 1. Extract specific patterns with re.findall ---")
    # Product ID pattern: Alphabet chars + Hyphen + Digits
    # By using r"ID:([A-Z]+-\d+)", we group and extract only the ID part.
    id_list = re.findall(r"ID:([A-Z]+-\d+)", text_data)
    print(f"Extracted IDs: {id_list}")

    print("\n--- 2. Search for date with re.search ---")
    # Date format: 4 digits - 2 digits - 2 digits
    date_match = re.search(r"Date:\s*(\d{4}-\d{2}-\d{2})", text_data)
    if date_match:
        # Get the content inside the parenthesis using group(1)
        print(f"Order Date: {date_match.group(1)}")

    print("\n--- 3. Format data with re.sub ---")
    # Remove 'JPY' from Price and keep only the number (assuming comma separation later)
    # Here, we simply replace the pattern with ' Yen'
    clean_text = re.sub(r"Price:(\d+)JPY", r"Price: \1 Yen", text_data)
    
    # Display only Item lines for verification
    # Process line by line using splitlines()
    print("Formatted Data (Excerpt):")
    for line in clean_text.splitlines():
        if "[Item]" in line:
            print(line.strip())

    print("\n--- 4. Split by delimiter with re.split ---")
    # Split a line like "ID:PROD-001    Price:1200JPY" by whitespace (spaces/tabs)
    sample_line = "ID:PROD-001    Price:1200JPY"
    # \s+ matches one or more whitespace characters
    parts = re.split(r"\s+", sample_line)
    print(f"Original Data: '{sample_line}'")
    print(f"Split Result : {parts}")

if __name__ == "__main__":
    main()

5. Summary

Python’s re module enables complex pattern matching that cannot be handled by simple string searches.

  • Use Raw Strings (r"...") to avoid the complexity of escaping backslashes.
  • Choose between search (find single match) and findall (extract all matches) depending on your needs.
  • Understand the characteristics of Metacharacters to build flexible search patterns.

By properly combining these tools, you can significantly improve efficiency in tasks such as log analysis, web scraping, and data cleansing.

よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次