[Python] How to Read HTML Tables with Pandas (read_html)

To parse table data from websites or local HTML files and import them as a DataFrame, use the pd.read_html function.

This article explains implementation examples distinguishing between the target HTML file and the Python script, along with key parameters.

目次

Key Parameters of read_html

The following are the main options for adjusting how read_html interprets table data.

ParameterMeaning / RoleExample Specification
headerSpecifies the row number to use as the header (column names). 0 uses the 1st row, 1 uses the 2nd row.0, 1
index_colSpecifies the column to use as the DataFrame index (row label). Can be specified by column number (0-based).0, 1
flavorSpecifies the HTML parsing engine. Change this if the HTML syntax is broken."lxml", "bs4", "html5lib"
attrsSpecifies attributes in a dictionary to extract only tables with specific id or class.{"id": "target_table"}

Implementation Sample

Here, we assume a situation where a locally saved HTML file (table_page.html) is read using Python.

1. Target HTML File

First, create the HTML file to be parsed. Here, we place two tables with different structures. The file name is table_page.html.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Sample Table Page</title>
</head>
<body>

    <h1>Product List (With Header)</h1>
    <table id="product_table" border="1">
        <thead>
            <tr>
                <th>Product_ID</th>
                <th>Name</th>
                <th>Price</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>101</td>
                <td>Desktop PC</td>
                <td>120000</td>
            </tr>
            <tr>
                <td>102</td>
                <td>Monitor</td>
                <td>35000</td>
            </tr>
        </tbody>
    </table>

    <br>

    <h1>Employee List (No Header / Irregular)</h1>
    <table id="employee_table" border="1">
        <tr>
            <td>Emp_001</td>
            <td>T.Suzuki</td>
            <td>Sales</td>
        </tr>
        <tr>
            <td>Emp_002</td>
            <td>M.Sato</td>
            <td>Engineering</td>
        </tr>
    </table>

</body>
</html>

2. Python Code to Read Tables

Next is the Python code that reads the HTML file above and processes it as a DataFrame.

read_html can accept not only page URLs but also local file paths as arguments.

import pandas as pd

def read_html_tables():
    """
    Function to read table data from an HTML file and 
    verify behavior differences based on parameters.
    """
    # File path of the target HTML
    file_path = "table_page.html"

    print("=== 1. Basic Reading ===")
    # read_html returns all detected tables as a "list of DataFrames".
    # Therefore, it is common to use a plural variable name (tables).
    try:
        tables = pd.read_html(file_path)
    except ValueError as e:
        print(f"Error: {e}")
        return

    print(f"Number of tables detected: {len(tables)}")
    
    # Display the first table (Product_ID etc. are recognized as headers)
    print("\n--- Table 1 (df_products) ---")
    df_products = tables[0]
    print(df_products)


    print("\n=== 2. Reading with Parameters ===")
    
    # Target the second table (Employee List)
    # header=None: Explicitly state there is no header row (treat row 0 as data)
    # index_col=0: Use the 0th column (Emp_ID) as the index
    # flavor="bs4": Specify BeautifulSoup as the parsing engine (flexible parsing)
    
    tables_custom = pd.read_html(
        file_path,
        header=None,
        index_col=0,
        flavor="bs4"  # You might also specify html5lib etc.
    )
    
    # Get the second table
    # Stored in the order they appear in the HTML, so specify index 1
    if len(tables_custom) > 1:
        df_employees = tables_custom[1]
        
        # Manually set column names for clarity
        df_employees.columns = ["Name", "Department"]
        
        print("\n--- Table 2 (df_employees) ---")
        print(df_employees)

if __name__ == "__main__":
    # Please ensure 'table_page.html' exists in the current directory beforehand
    read_html_tables()

Execution Result

=== 1. Basic Reading ===
Number of tables detected: 2

--- Table 1 (df_products) ---
   Product_ID        Name   Price
0         101  Desktop PC  120000
1         102     Monitor   35000

=== 2. Reading with Parameters ===

--- Table 2 (df_employees) ---
             Name   Department
0                             
Emp_001  T.Suzuki        Sales
Emp_002    M.Sato  Engineering

Important Notes

  • Dependencies: To use this feature, you need to install additional HTML parser libraries such as lxml, html5lib, or beautifulsoup4 (e.g., pip install lxml html5lib beautifulsoup4).
  • Return Value Type: The return value of read_html is always a list. Even if there is only one table, it returns in the format [DataFrame], so you need to extract individual DataFrames using an index, such as tables[0].
よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次