Overview
This article explains how to properly extract header information (Subject, From, To) and the message body (Plain Text or HTML) from an EmailMessage object created by Python’s email module. We will introduce a modern method using get_body() to retrieve the content from multipart emails by specifying a priority order.
Specifications (Input/Output)
- Input: An
email.message.EmailMessageobject (assumed to have been created usingmessage_from_bytes(..., policy=policy.default)). - Output:
- Strings for Subject, From, and To headers.
- String for the email body (HTML or Plain Text).
Basic Usage
Header information can be retrieved by specifying keys, similar to a dictionary. By using the get_body() method, you can retrieve the body without needing to worry about the complex multipart structure.
from email import policy, message_from_bytes
# Assuming raw_email is the received byte data
msg = message_from_bytes(raw_email, policy=policy.default)
# 1. Get header information
print(f"Subject: {msg.get('Subject')}")
print(f"From: {msg.get('From')}")
# 2. Get the body (Priority: Plain Text)
body = msg.get_body(preferencelist=('plain', 'html'))
if body:
print(body.get_content())
Full Code
This is a complete parsing code that generates an object from raw bytes and extracts various headers and the body (HTML if available, otherwise Text).
from email import message_from_bytes, policy
from email.message import EmailMessage
def parse_email_object():
"""
Demo function to create an EmailMessage object from sample raw email data
and extract the subject, recipients, and body.
"""
# Sample raw email data (bytes)
# Usually, this is retrieved from a server using imaplib or similar libraries.
raw_email_data = b"""\
MIME-Version: 1.0
Subject: =?utf-8?B?44OG44K544OI44Oh44O844Or44Gu5Lu25ZCN?=
From: sender@example.com
To: receiver@example.com
Content-Type: multipart/alternative; boundary="boundary_text"
--boundary_text
Content-Type: text/plain; charset="utf-8"
This is the plain text body.
--boundary_text
Content-Type: text/html; charset="utf-8"
<html><body><h1>This is the HTML body.</h1></body></html>
--boundary_text--
"""
# 1. Create EmailMessage object
# Specifying policy=policy.default is very important as it enables header decoding
# and makes the get_body() method available.
msg = message_from_bytes(raw_email_data, policy=policy.default)
print("--- Header Information ---")
# Retrieve using msg.get(header_name). Returns None if the key does not exist.
subject = msg.get("Subject")
sender = msg.get("From")
receiver = msg.get("To")
date = msg.get("Date")
print(f"Subject : {subject}")
print(f"From : {sender}")
print(f"To : {receiver}")
print(f"Date : {date}")
print("\n--- Extracting the Body ---")
# 2. Identify the body part (get_body)
# Use preferencelist to specify the priority of formats you want to retrieve.
# ('html', 'plain') -> Gets HTML if available, otherwise Text.
# ('plain', 'html') -> Gets Text if available, otherwise HTML.
body_part = msg.get_body(preferencelist=('html', 'plain'))
if body_part:
# 3. Extract content (get_content)
# Retrieves the actual string data (automatically decoded).
content = body_part.get_content()
# Check which type was retrieved
content_type = body_part.get_content_type()
print(f"Content Type: {content_type}")
print("Content:")
print(content)
else:
print("Body not found (it might be an email with only attachments).")
if __name__ == "__main__":
parse_email_object()
Customization Points
Main Methods of EmailMessage Objects
The following table lists the main methods and attributes used for parsing.
| Method / Attribute | Description | Example |
msg.get("Header-Name") | Retrieves the value of a specific header. If policy.default is applied, it returns a decoded string (like Japanese). | msg.get("Subject") |
msg["Header-Name"] | Access headers in a dictionary style. Similar to get, but behaves differently if the key is missing (usually get is recommended). | msg["From"] |
msg.get_body(preferencelist=...) | Returns the first part (as an EmailMessage object) that matches the specified priority list (e.g., html, plain) in a multipart email. | msg.get_body(preferencelist=('plain',)) |
part.get_content() | Returns the payload (content) of that part as a decoded string or byte sequence. | body_part.get_content() |
part.iter_attachments() | Returns an iterator for the parts treated as attachments. | for f in msg.iter_attachments(): |
Preference List (preferencelist)
The preferencelist argument in the get_body method takes a tuple or list of subtypes (e.g., html for text/html).
- (‘html’, ‘plain’): Use this when you prefer a rich visual display.
- (‘plain’, ‘html’): Use this when you prefer simple text processing or log saving.
Important Notes
Forgetting the Policy Specification
If you omit the policy argument, such as message_from_bytes(data), the old compat32 policy is applied. In this mode, the get_body() method does not exist (causing an error), and Japanese subjects will remain encoded (e.g., ?utf-8?...). Always specify policy=policy.default.
Cases with No Body
If an email contains only attachments or is empty, get_body() will return None. Always include a check like if body_part:.
Encoding
The get_content() method automatically decodes content based on the charset parameter in the Content-Type header. However, if the sender’s settings are incorrect, you may encounter garbled text or a UnicodeDecodeError.
Advanced Usage
This is an example of forcefully extracting text by removing HTML tags when only an HTML body exists.
from email import policy, message_from_bytes
import re
def extract_text_forcefully(raw_data):
msg = message_from_bytes(raw_data, policy=policy.default)
# Search for the text part
body_part = msg.get_body(preferencelist=('plain',))
if body_part:
return body_part.get_content()
# If there is no text and only HTML exists
html_part = msg.get_body(preferencelist=('html',))
if html_part:
html_content = html_part.get_content()
# Simple tag removal (using regex)
# For professional use, please use a library like BeautifulSoup.
text_content = re.sub('<[^>]+>', '', html_content)
return text_content.strip()
return ""
Summary
To parse an EmailMessage object, start by creating it with policy.default. After that, you can simply use msg.get("Subject") for headers and msg.get_body() for the content. This allows you to utilize email data in your program without worrying about complex MIME structures.
