The basic flow of “scraping” to collect specific data from web pages involves fetching HTML with requests, parsing the structure with BeautifulSoup, and identifying necessary tags to extract data.
Here, we will create code to extract a list of product titles and URLs to detail pages, targeting the public sandbox site (Books to Scrape) designed for scraping practice.
Executable Sample Code
This code extracts “Product Name” and “Link URL” from a fictional bookstore site and outputs them to the console. Pay attention to how it correctly traces the hierarchy of nested tags (article > h3 > a).
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
def scrape_product_list():
# Site for scraping practice (Book list)
target_url = "http://books.toscrape.com/"
print(f"Fetching: {target_url} ...")
try:
# 1. Fetch HTML
response = requests.get(target_url, timeout=10)
response.raise_for_status()
# 2. Create BeautifulSoup object
# Using standard html.parser (html5lib is also acceptable)
soup = BeautifulSoup(response.text, "html.parser")
# 3. Identify product list
# Identify parent or repeating elements according to the site structure
# Here, <article class="product_pod"> is the container for each product
products = soup.find_all("article", class_="product_pod")
print(f"Found {len(products)} products.\n")
# 4. Extract information for each product in a loop
for product in products:
# Look for h3 > a tag inside the article tag
h3_tag = product.find("h3")
if h3_tag:
link_tag = h3_tag.find("a")
# If a tag is found, get text and href attribute
if link_tag:
# Get title attribute for tooltip (full name without abbreviation)
title = link_tag.get("title")
# Get relative path
href = link_tag.get("href")
# Convert relative path to absolute path (practical processing)
full_url = urljoin(target_url, href)
print(f"Title: {title}")
print(f"URL: {full_url}")
print("-" * 40)
except requests.RequestException as e:
print(f"Communication error occurred: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
if __name__ == "__main__":
scrape_product_list()
Explanation: Key Points of Scraping
1. Distinction between find and find_all
soup.find_all(...): Retrieves all elements that exist multiple times on the page (e.g., product lists, news article lists) and returns them in a list format. You process these one by one using aforloop.tag.find(...): Searches for only one specific child element (e.g., the title within that product) from within a specific element (parent element).
2. Retrieving Attribute Values (.get)
Attribute information such as link URLs is embedded inside HTML tags, not in the tag’s text (.text). To extract the value of href from a tag like <a href="catalogue/..." ...>, use the link_tag.get("href") method.
3. Joining URLs (urljoin)
If the retrieved link is a “relative path” like catalogue/page-1.html, it cannot be opened in a browser as is. A common processing pattern is to use urllib.parse.urljoin to combine it with the base URL and convert it into an “absolute path.”
