In Regular Expressions, “grouping” using parentheses () is a powerful feature for extracting specific parts of a matched string rather than the whole string.
I will explain how to efficiently structure text data into a list by combining the findall function from Python’s standard re module with grouping.
Table of Contents
- The Problem to Solve
- Implementation Example: Parsing Server Access Logs
- Source Code
- Execution Result
- Explanation
The Problem to Solve
You often want to isolate and extract specific items (such as IDs, codes, or names) from log files or fixed-format text data. While simple matching retrieves the entire line, using the grouping feature allows you to obtain only the necessary elements as a tuple.
Implementation Example: Parsing Server Access Logs
In this scenario, we will extract three elements—”Request ID,” “Method,” and “Path”—from text data simulating server access logs.
Source Code
import re
# Text data to be analyzed (Assuming server logs)
# Format: [RequestId] Method Path
log_data = """
[Req001] GET /index.html
[Req002] POST /api/login
[Req003] GET /css/style.css
[Req004] DELETE /api/users/10
[Req005] PUT /api/settings
"""
# Definition of regex pattern
# Parts enclosed in () are extracted as groups
# 1. \[(Req\d+)\] : Extracts "Req" + digits inside []
# 2. ([A-Z]+) : Extracts uppercase alphabets (Method)
# 3. (/[a-zA-Z0-9/._-]+) : Extracts the path
pattern = r'\[(Req\d+)\]\s+([A-Z]+)\s+(/[a-zA-Z0-9/._-]+)'
# Search for all parts matching the pattern with re.findall
# Because grouping is used, each match becomes a tuple of (Group 1, Group 2, Group 3)
matches = re.findall(pattern, log_data)
# Output results
print(f"Extracted count: {len(matches)}")
print("-" * 30)
for request_id, method, path in matches:
print(f"ID: {request_id} | Method: {method:6} | Path: {path}")
Execution Result
Extracted count: 5
------------------------------
ID: Req001 | Method: GET | Path: /index.html
ID: Req002 | Method: POST | Path: /api/login
ID: Req003 | Method: GET | Path: /css/style.css
ID: Req004 | Method: DELETE | Path: /api/users/10
ID: Req005 | Method: PUT | Path: /api/settings
Explanation
About Regex Grouping
When you use () within a regular expression pattern, that part is treated as a “capture group.”
If groups exist within the pattern, re.findall returns a list of tuples containing the strings matched by the groups, rather than the entire matched string.
- Pattern:
r'\[(Req\d+)\]\s+([A-Z]+)\s+(.+)' - Return Value:
[('Req001', 'GET', '/index.html'), ('Req002', 'POST', '/api/login'), ...]
This allows you to structure data more safely and concisely compared to splitting strings using methods like split.
Important Note
re.findallreturns a list of all non-overlapping matches.- If there is only one group in the pattern, it returns a list of strings instead of tuples. If you want to extract multiple items, define multiple groups as shown in the code above.
