[Python] How to Extract Specific Information from Text Using re.findall and Grouping

In Regular Expressions, “grouping” using parentheses () is a powerful feature for extracting specific parts of a matched string rather than the whole string.

I will explain how to efficiently structure text data into a list by combining the findall function from Python’s standard re module with grouping.

目次

Table of Contents

  1. The Problem to Solve
  2. Implementation Example: Parsing Server Access Logs
  3. Source Code
  4. Execution Result
  5. Explanation

The Problem to Solve

You often want to isolate and extract specific items (such as IDs, codes, or names) from log files or fixed-format text data. While simple matching retrieves the entire line, using the grouping feature allows you to obtain only the necessary elements as a tuple.


Implementation Example: Parsing Server Access Logs

In this scenario, we will extract three elements—”Request ID,” “Method,” and “Path”—from text data simulating server access logs.

Source Code

import re

# Text data to be analyzed (Assuming server logs)
# Format: [RequestId] Method Path
log_data = """
[Req001] GET /index.html
[Req002] POST /api/login
[Req003] GET /css/style.css
[Req004] DELETE /api/users/10
[Req005] PUT /api/settings
"""

# Definition of regex pattern
# Parts enclosed in () are extracted as groups
# 1. \[(Req\d+)\]         : Extracts "Req" + digits inside []
# 2. ([A-Z]+)             : Extracts uppercase alphabets (Method)
# 3. (/[a-zA-Z0-9/._-]+)  : Extracts the path
pattern = r'\[(Req\d+)\]\s+([A-Z]+)\s+(/[a-zA-Z0-9/._-]+)'

# Search for all parts matching the pattern with re.findall
# Because grouping is used, each match becomes a tuple of (Group 1, Group 2, Group 3)
matches = re.findall(pattern, log_data)

# Output results
print(f"Extracted count: {len(matches)}")
print("-" * 30)

for request_id, method, path in matches:
    print(f"ID: {request_id} | Method: {method:6} | Path: {path}")

Execution Result

Extracted count: 5
------------------------------
ID: Req001 | Method: GET    | Path: /index.html
ID: Req002 | Method: POST   | Path: /api/login
ID: Req003 | Method: GET    | Path: /css/style.css
ID: Req004 | Method: DELETE | Path: /api/users/10
ID: Req005 | Method: PUT    | Path: /api/settings

Explanation

About Regex Grouping

When you use () within a regular expression pattern, that part is treated as a “capture group.”

If groups exist within the pattern, re.findall returns a list of tuples containing the strings matched by the groups, rather than the entire matched string.

  • Pattern: r'\[(Req\d+)\]\s+([A-Z]+)\s+(.+)'
  • Return Value: [('Req001', 'GET', '/index.html'), ('Req002', 'POST', '/api/login'), ...]

This allows you to structure data more safely and concisely compared to splitting strings using methods like split.

Important Note

  • re.findall returns a list of all non-overlapping matches.
  • If there is only one group in the pattern, it returns a list of strings instead of tuples. If you want to extract multiple items, define multiple groups as shown in the code above.
よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次