The re module is indispensable for advanced text processing in Python. In this article, I will systematically explain its main functions, special characters (metacharacters), and the behavior of Raw Strings.
Regular expression syntax can often become complex, so this list is organized to serve as a quick reference during implementation.
Table of Contents
- Using Raw Strings
- Key Functions of the
reModule - List of Common Regular Expressions (Metacharacters)
- Practical Code Examples
- Summary
1. Using Raw Strings
In regular expressions, the backslash (\) is frequently used as an escape character. However, since Python’s standard strings also treat the backslash as an escape character, passing a literal backslash to the regex engine requires writing it as \\, which can make the code messy and hard to read.
By using “Raw Strings” (prefixing the string with r), you can instruct Python to interpret backslashes literally. This dramatically improves readability.
import re
# Standard String: Requires extra escaping to represent backslashes (Hard to read)
# Intention: Matches \my-host\.*
pattern_normal = "\\\\my-host\\\\.*"
# Raw String: Can be written exactly as it appears (Recommended)
pattern_raw = r"\\my-host\\.*"
2. Key Functions of the re Module
The following table summarizes the four most frequently used functions and their return values.
| Function Name | Return Value | Explanation |
re.search() | re.Match object | Scans the entire string and returns the first match found. Returns None if not found. |
re.findall() | list (of strings) | Finds all non-overlapping matches in the string and returns them as a list. |
re.split() | list (of strings) | Splits the string by the occurrences of the pattern. |
re.sub() | str (replaced string) | Replaces the matches of the pattern with a specified string. |
3. List of Common Regular Expressions (Metacharacters)
This list explains the meanings and examples of special characters that construct regex patterns.
| Regex | Meaning | Example | Explanation of Example |
. | Any character (except newline) | a.c | Matches “abc”, “a1c”, etc. |
^ | Start of line | ^Start | Matches if the line starts with “Start”. |
$ | End of line | End$ | Matches if the line ends with “End”. |
* | 0 or more occurrences of the preceding character | ab*c | Matches “ac”, “abc”, “abbc”. |
+ | 1 or more occurrences of the preceding character | ab+c | Matches “abc”, “abbc” (Does not match “ac”). |
? | 0 or 1 occurrence of the preceding character | ab?c | Matches only “ac” or “abc”. |
| **` | `** | Either pattern (OR) | `Apple |
(...) | Grouping | (ab)+ | Matches repetitions like “ab”, “abab”. |
[...] | Character set (Any one character) | [a-z] | Matches one lowercase alphabet letter. |
[^...] | Negative character set (Not in set) | [^0-9] | Matches any character that is not a number. |
\ | Escape | \. | Treats the special character . as a literal dot. |
{n} | Exactly n occurrences | \d{3} | Matches exactly 3 digits (e.g., 123). |
{n,} | n or more occurrences | \d{3,} | Matches 3 or more digits. |
{n,m} | Between n and m occurrences | \d{3,5} | Matches between 3 and 5 digits. |
4. Practical Code Examples
Below is an example code that combines the functions and regex symbols explained above to extract and format necessary information from unstructured text data. Here, we assume a scenario of extracting product IDs and prices from an order email.
import re
def main():
# Text data to be analyzed (e.g., excerpt from an order email)
text_data = """
Order Date: 2023-11-05
[Item] ID:PROD-001 Price:1200JPY Note:SALE
[Item] ID:PROD-888 Price:3500JPY Note:Standard
[Item] ID:ACC-99 Price:500JPY Note:Bulk
End of list.
"""
print("--- 1. Extract specific patterns with re.findall ---")
# Product ID pattern: Alphabet chars + Hyphen + Digits
# By using r"ID:([A-Z]+-\d+)", we group and extract only the ID part.
id_list = re.findall(r"ID:([A-Z]+-\d+)", text_data)
print(f"Extracted IDs: {id_list}")
print("\n--- 2. Search for date with re.search ---")
# Date format: 4 digits - 2 digits - 2 digits
date_match = re.search(r"Date:\s*(\d{4}-\d{2}-\d{2})", text_data)
if date_match:
# Get the content inside the parenthesis using group(1)
print(f"Order Date: {date_match.group(1)}")
print("\n--- 3. Format data with re.sub ---")
# Remove 'JPY' from Price and keep only the number (assuming comma separation later)
# Here, we simply replace the pattern with ' Yen'
clean_text = re.sub(r"Price:(\d+)JPY", r"Price: \1 Yen", text_data)
# Display only Item lines for verification
# Process line by line using splitlines()
print("Formatted Data (Excerpt):")
for line in clean_text.splitlines():
if "[Item]" in line:
print(line.strip())
print("\n--- 4. Split by delimiter with re.split ---")
# Split a line like "ID:PROD-001 Price:1200JPY" by whitespace (spaces/tabs)
sample_line = "ID:PROD-001 Price:1200JPY"
# \s+ matches one or more whitespace characters
parts = re.split(r"\s+", sample_line)
print(f"Original Data: '{sample_line}'")
print(f"Split Result : {parts}")
if __name__ == "__main__":
main()
5. Summary
Python’s re module enables complex pattern matching that cannot be handled by simple string searches.
- Use Raw Strings (
r"...") to avoid the complexity of escaping backslashes. - Choose between
search(find single match) andfindall(extract all matches) depending on your needs. - Understand the characteristics of Metacharacters to build flexible search patterns.
By properly combining these tools, you can significantly improve efficiency in tasks such as log analysis, web scraping, and data cleansing.
