In web scraping, mastering the find() method is crucial for accurately extracting only the necessary data from the entire HTML.
By combining not just simple tag name searches but also id and class attributes, or by searching further down the hierarchy from retrieved elements, you can handle complex web pages.
Here, using the HTML structure of a movie review site as an example, I will explain three main patterns to pinpoint the desired elements.
Executable Sample Code
The following code is a script that extracts the “Movie Title,” “Featured Section,” “Rating Score,” etc., from an HTML string using various conditions.
from bs4 import BeautifulSoup
def main():
# HTML to be scraped (Assuming a movie review site)
html_doc = """
<html>
<body>
<header>
<h1>Movie Database</h1>
</header>
<div id="featured-content" class="highlight-box">
<h2>Inception</h2>
<div class="meta-data">
<span class="genre">Sci-Fi</span>
<span class="rating">Score: 9.8</span>
</div>
<p class="description">A thief who steals corporate secrets...</p>
</div>
<div id="sidebar">
<h2>Top Ranking</h2>
<ul>
<li class="rank-item">The Dark Knight</li>
<li class="rank-item">Interstellar</li>
</ul>
</div>
</body>
</html>
"""
# Create BeautifulSoup object
soup = BeautifulSoup(html_doc, "html5lib")
print("=== 1. Search by Tag Name Only ===")
# Retrieve the first <h1> tag in the page
h1_tag = soup.find("h1")
print(f"Site Title: {h1_tag.text}")
print("\n=== 2. Search by ID Attribute ===")
# Identify the tag with id="featured-content"
# Since IDs are unique within a page, this is optimal for accessing specific elements
featured_div = soup.find(id="featured-content")
# Display the content (tag structure) of the retrieved Tag object
# Using prettify() formats the output
print(f"Found Block:\n{featured_div.prettify().strip()[:100]}...")
print("\n=== 3. Search by Tag Name and Class Attribute ===")
# Retrieve <span> tag with class="rating"
# To distinguish from Python's reserved word 'class', the argument name is 'class_'
rating_span = soup.find("span", class_="rating")
print(f"Rating: {rating_span.text}")
print("\n=== 4. Narrowing Down Using Hierarchical Structure (Method Chaining) ===")
# First, retrieve the parent element (sidebar)
sidebar = soup.find(id="sidebar")
# Search for a child element (first li tag) inside the parent element
# This limits the scope compared to searching the entire document, preventing false positives
if sidebar:
top_rank = sidebar.find("li", class_="rank-item")
print(f"Top 1 Movie: {top_rank.text}")
if __name__ == "__main__":
main()
Explanation: How to Specify Search Conditions
1. Specifying by Tag Name
soup.find("h1")
This is the most basic search method. It returns the first element that matches the specified tag name.
2. Specifying by ID Attribute
soup.find(id="specific-id")
Since an id is a unique identifier within a page in HTML, it is suitable for pinpointing a specific location (such as a main content frame) in one shot.
3. Specifying by Class Attribute (Important)
soup.find("div", class_="some-class")
In Python, class is a reserved word for class definitions, so it cannot be used as an argument as is. Instead, use class_ with an underscore attached.
Also, by writing soup.find(class_="some-class") without specifying a tag name, you can search for elements with that class regardless of the tag type.
4. Re-searching from Retrieved Tag Objects
div = soup.find(id="container")
p = div.find("p")
The return value of the find() method is a Tag object, and this object itself also has a find() method.
By utilizing this, you can narrow down the search by “getting a specific area (parent element) and looking for a specific element (child element) inside it.” This is an effective technique to avoid retrieving incorrect elements when many identical tags exist on the page.
