【Python】Getting Element Text, Attributes, and HTML with Selenium

目次

Overview

This article explains how to extract information from HTML elements (WebElements) identified with Selenium. This includes the “text” displayed on the browser, “attribute values” (such as href, src, id) defined within HTML tags, and the “HTML source” of the element. This is the most important step in web scraping for retrieving actual data.

Specifications (Input/Output)

  • Input: A WebElement object already obtained using methods like find_element.
  • Output: Extracted strings (text, URLs, HTML code, etc.).
  • Requirement: Selenium WebDriver must be working correctly.

Basic Usage

This is the basic pattern to get an element’s text and its link URL (href attribute).

# element = driver.find_element(...)

# 1. Get visible text
# Gets the text enclosed by the tags
print(f"Text: {element.text}")

# 2. Get attribute value
# Gets the "https://..." part of <a href="https://...">
link_url = element.get_attribute("href")
print(f"Link URL: {link_url}")

Full Code

This practical demo code shows how to get HTML data attributes (such as data-price) and the internal HTML source (innerHTML). To make it easy to test, it uses the data: scheme to generate and load a simple HTML page on the fly.

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

def extract_element_info():
    """
    Demo function to extract text, attributes, and HTML source from elements.
    """
    driver = webdriver.Chrome()

    # Dummy HTML content (mimicking a product card structure)
    # <div id="product-card" class="item" data-id="999" data-price="1500">
    #     <h2 class="title">Official Python T-Shirt</h2>
    #     <a href="/buy/999">Buy Now</a>
    # </div>
    html_content = """
    data:text/html;charset=utf-8,
    <div id='product-card' class='item' data-id='999' data-price='1500'>
        <h2 class='title'>Official Python T-Shirt</h2>
        <a href='/buy/999'>Buy Now</a>
        <span style='display:none'>Hidden Info</span>
    </div>
    """

    try:
        # 1. Open the page
        driver.get(html_content)
        
        # 2. Locate the target element (Parent: product-card)
        card_element = driver.find_element(By.ID, "product-card")

        print("--- Basic Information ---")
        # .text property
        # Text from child elements is included, but hidden elements (display:none) are excluded.
        print(f"[text] Visible Text:\n{card_element.text}")

        print("\n--- Getting Attributes ---")
        # .get_attribute(attribute_name)
        product_id = card_element.get_attribute("data-id")
        price = card_element.get_attribute("data-price")
        class_name = card_element.get_attribute("class")
        
        print(f"[data-id]    : {product_id}")
        print(f"[data-price] : {price}")
        print(f"[class]      : {class_name}")

        print("\n--- Getting HTML Source ---")
        # innerHTML: Gets the content inside the element (including tags)
        inner = card_element.get_attribute("innerHTML")
        print(f"[innerHTML]: {inner.strip()}")
        
        # outerHTML: Gets the full HTML including the element itself
        outer = card_element.get_attribute("outerHTML")
        print(f"[outerHTML]: {outer.strip()}")

    finally:
        driver.quit()

if __name__ == "__main__":
    extract_element_info()

Customization Points

Mapping Table for Properties and Methods

These are the main properties and methods of a Selenium WebElement object.

SyntaxTypeContent RetrievedExample
element.textPropertyVisible text inside the elementTop Page
element.get_attribute(“href”)MethodTarget URL of a linkhttps://example.com
element.get_attribute(“src”)MethodSource URL of images or scriptsimg/logo.png
element.get_attribute(“value”)MethodValue entered in a formMyPassword123
element.get_attribute(“innerHTML”)MethodHTML between the start and end tags<span>Text</span>
element.get_attribute(“outerHTML”)MethodComplete HTML including the element itself<div><span>Text</span></div>

Important Notes

The Trap of the text Property (Hidden Elements)

element.text only returns text that is visible to the user in the browser. Text from elements where display: none; or visibility: hidden; is applied via CSS will return an empty string. If you want to get hidden text, use element.get_attribute("textContent").

If an Attribute Does Not Exist

If the specified attribute is not written in the HTML tag, get_attribute() returns None. While this does not cause an immediate error, be careful not to trigger errors in later processing (such as string operations).

URL Completion

When getting href or src attributes, Selenium may automatically convert relative paths (/page/1) into absolute URLs (http://site.com/page/1) depending on the browser’s behavior.

Advanced Usage

This example shows how to retrieve table data in a list format.

from selenium import webdriver
from selenium.webdriver.common.by import By

def scrape_table_data():
    driver = webdriver.Chrome()
    # Sample table
    driver.get("data:text/html;charset=utf-8,<table><tr><td>Apple</td><td>100</td></tr><tr><td>Orange</td><td>50</td></tr></table>")

    try:
        # Get all tr tags (rows)
        rows = driver.find_elements(By.TAG_NAME, "tr")
        
        data_list = []
        for row in rows:
            # Get td tags (cells) inside each row
            cells = row.find_elements(By.TAG_NAME, "td")
            
            # Extract text and convert to a list
            # Format: [Apple, 100], [Orange, 50]
            row_data = [cell.text for cell in cells]
            data_list.append(row_data)

        print(f"Extracted Data: {data_list}")

    finally:
        driver.quit()

if __name__ == "__main__":
    scrape_table_data()

Summary

  • Use element.text if you want the visible characters.
  • Use element.get_attribute("attribute_name") if you want background values (URLs, IDs, form inputs).
  • Use element.get_attribute("outerHTML") if you want the HTML structure itself.

By choosing between these three, you can bring any information from a web page into Python.

よかったらシェアしてね!
  • URLをコピーしました!
  • URLをコピーしました!

この記事を書いた人

私が勉強したこと、実践したこと、してることを書いているブログです。
主に資産運用について書いていたのですが、
最近はプログラミングに興味があるので、今はそればっかりです。

目次