To parse table data from websites or local HTML files and import them as a DataFrame, use the pd.read_html function.
This article explains implementation examples distinguishing between the target HTML file and the Python script, along with key parameters.
Key Parameters of read_html
The following are the main options for adjusting how read_html interprets table data.
| Parameter | Meaning / Role | Example Specification |
| header | Specifies the row number to use as the header (column names). 0 uses the 1st row, 1 uses the 2nd row. | 0, 1 |
| index_col | Specifies the column to use as the DataFrame index (row label). Can be specified by column number (0-based). | 0, 1 |
| flavor | Specifies the HTML parsing engine. Change this if the HTML syntax is broken. | "lxml", "bs4", "html5lib" |
| attrs | Specifies attributes in a dictionary to extract only tables with specific id or class. | {"id": "target_table"} |
Implementation Sample
Here, we assume a situation where a locally saved HTML file (table_page.html) is read using Python.
1. Target HTML File
First, create the HTML file to be parsed. Here, we place two tables with different structures. The file name is table_page.html.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Sample Table Page</title>
</head>
<body>
<h1>Product List (With Header)</h1>
<table id="product_table" border="1">
<thead>
<tr>
<th>Product_ID</th>
<th>Name</th>
<th>Price</th>
</tr>
</thead>
<tbody>
<tr>
<td>101</td>
<td>Desktop PC</td>
<td>120000</td>
</tr>
<tr>
<td>102</td>
<td>Monitor</td>
<td>35000</td>
</tr>
</tbody>
</table>
<br>
<h1>Employee List (No Header / Irregular)</h1>
<table id="employee_table" border="1">
<tr>
<td>Emp_001</td>
<td>T.Suzuki</td>
<td>Sales</td>
</tr>
<tr>
<td>Emp_002</td>
<td>M.Sato</td>
<td>Engineering</td>
</tr>
</table>
</body>
</html>
2. Python Code to Read Tables
Next is the Python code that reads the HTML file above and processes it as a DataFrame.
read_html can accept not only page URLs but also local file paths as arguments.
import pandas as pd
def read_html_tables():
"""
Function to read table data from an HTML file and
verify behavior differences based on parameters.
"""
# File path of the target HTML
file_path = "table_page.html"
print("=== 1. Basic Reading ===")
# read_html returns all detected tables as a "list of DataFrames".
# Therefore, it is common to use a plural variable name (tables).
try:
tables = pd.read_html(file_path)
except ValueError as e:
print(f"Error: {e}")
return
print(f"Number of tables detected: {len(tables)}")
# Display the first table (Product_ID etc. are recognized as headers)
print("\n--- Table 1 (df_products) ---")
df_products = tables[0]
print(df_products)
print("\n=== 2. Reading with Parameters ===")
# Target the second table (Employee List)
# header=None: Explicitly state there is no header row (treat row 0 as data)
# index_col=0: Use the 0th column (Emp_ID) as the index
# flavor="bs4": Specify BeautifulSoup as the parsing engine (flexible parsing)
tables_custom = pd.read_html(
file_path,
header=None,
index_col=0,
flavor="bs4" # You might also specify html5lib etc.
)
# Get the second table
# Stored in the order they appear in the HTML, so specify index 1
if len(tables_custom) > 1:
df_employees = tables_custom[1]
# Manually set column names for clarity
df_employees.columns = ["Name", "Department"]
print("\n--- Table 2 (df_employees) ---")
print(df_employees)
if __name__ == "__main__":
# Please ensure 'table_page.html' exists in the current directory beforehand
read_html_tables()
Execution Result
=== 1. Basic Reading ===
Number of tables detected: 2
--- Table 1 (df_products) ---
Product_ID Name Price
0 101 Desktop PC 120000
1 102 Monitor 35000
=== 2. Reading with Parameters ===
--- Table 2 (df_employees) ---
Name Department
0
Emp_001 T.Suzuki Sales
Emp_002 M.Sato Engineering
Important Notes
- Dependencies: To use this feature, you need to install additional HTML parser libraries such as
lxml,html5lib, orbeautifulsoup4(e.g.,pip install lxml html5lib beautifulsoup4). - Return Value Type: The return value of
read_htmlis always a list. Even if there is only one table, it returns in the format[DataFrame], so you need to extract individual DataFrames using an index, such astables[0].
