[Python] Leveraging Regex Flags: A Complete Guide to Multiline and DOTALL Modes

When handling text data containing newlines in Python’s re module, specifying flags is essential.

In particular, if you do not correctly understand the behavior of re.MULTILINE (for line-by-line matching) and re.DOTALL (for matching across newlines), you may not get the intended results.

In this article, I will explain the impact of these flags on special characters (^, $, .) and how to use them effectively, using chat logs as an example.

目次

Table of Contents

  1. Correspondence Table of Special Characters and Flags
  2. Implementation Example: Chat Log Analysis
  3. Source Code
  4. Execution Result
  5. Explanation

1. Correspondence Table of Special Characters and Flags

Special CharacterDefault Behaviorre.MULTILINEre.DOTALL
. (Dot)Matches 1 char except newlineNo changeMatches all chars including newline
^ (Caret)Start of the entire string onlyStart of each lineNo change
$ (Dollar)End of the entire string onlyEnd of each lineNo change

2. Implementation Example: Chat Log Analysis

We will assume a chat log where user messages may span multiple lines, and we will perform two different types of extractions.

Scenario

From the log format below, we want to extract:

  1. Only the header line of each message (Username and Timestamp).
  2. The entire speech block, including the header and the message body.

Source Code

import re

# Analysis target: Chat app log data
# Contains username and timestamp lines, followed by messages (potentially multi-line)
chat_log = """[UserA] 10:00
Hello everyone.
Check this out.

[UserB] 10:05
Good morning!
I will check it later.

[UserC] 10:10
Thanks."""

# Pattern A: Extract only lines starting with [User...]
# ^ : Start of line, .+ : 1 or more characters
pattern_line = r"^\[User.*\].+$"

# Pattern B: Extract the entire block from [User...] to the next empty line (or end)
# ^ : Start of line, .+? : Lazy match, $ : End of line (End of block)
pattern_block = r"^\[User.*?\].+?$"

print("--- 1. Using re.MULTILINE only ---")
# Match ^ and $ to the start/end of "each line"
# Result: Only header lines are extracted (message bodies are ignored)
headers = re.findall(pattern_line, chat_log, flags=re.MULTILINE)
for h in headers:
    print(f"Header found: {h}")

print("\n--- 2. Combining re.MULTILINE | re.DOTALL ---")
# MULTILINE : Matches ^ to the start position of each block
# DOTALL    : Makes . match newlines, including multi-line messages
# Result: The entire speech block for each user is extracted
blocks = re.findall(pattern_block, chat_log, flags=re.MULTILINE | re.DOTALL)

for i, block in enumerate(blocks, 1):
    print(f"--- Block {i} ---\n{block}")

Execution Result

--- 1. Using re.MULTILINE only ---
Header found: [UserA] 10:00
Header found: [UserB] 10:05
Header found: [UserC] 10:10

--- 2. Combining re.MULTILINE | re.DOTALL ---
--- Block 1 ---
[UserA] 10:00
Hello everyone.
Check this out.
--- Block 2 ---
[UserB] 10:05
Good morning!
I will check it later.
--- Block 3 ---
[UserC] 10:10
Thanks.

5. Explanation

1. Processing each line individually (re.MULTILINE)

By default, ^ only matches the very beginning of the entire string (before [UserA]).

By specifying re.MULTILINE, ^ is interpreted as the “start of each line,” allowing it to detect the start of the lines for [UserB] and [UserC] as well. This is effective when you want to list only the headers of a log.

2. Grouping ranges including newlines (re.DOTALL)

If the message body contains newlines, the standard . will stop matching at the newline.

By specifying re.DOTALL, . will also match the newline character (\n), allowing you to capture text blocks that span multiple lines at once.

You can specify these two flags simultaneously using the bitwise OR operator |, like flags=re.MULTILINE | re.DOTALL. This enables flexible searches that “start from the beginning of a line and include content spanning multiple lines.”

よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次