Python BeautifulSoup Tutorial: Web scraping in 20 lines of code
Learn how to scrape a web page in 20 lines of code using Python and BeautifulSoup.
Web scraping is the process of extracting data from websites. Python, combined with BeautifulSoup, makes web scraping simple and efficient. In this tutorial, you’ll learn how to scrape a web page in just 20 lines of code.
What is BeautifulSoup?
BeautifulSoup is a Python library that makes it easy to scrape information from web pages. It sits on top of an HTML or XML parser and provides Pythonic idioms for iterating, searching, and modifying the parse tree.
Installation
First, install the required libraries:
pip install beautifulsoup4 requests
Or if you prefer:
pip install beautifulsoup4 lxml
Basic Web Scraping Example
Here’s a simple example that scrapes the title and all links from a webpage:
import requests
from bs4 import BeautifulSoup
# Fetch the webpage
url = 'https://example.com'
response = requests.get(url)
html_content = response.content
# Parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')
# Extract title
title = soup.find('title').text
print(f"Page Title: {title}")
# Extract all links
links = soup.find_all('a')
for link in links:
href = link.get('href')
text = link.text
print(f"{text}: {href}")
Scraping Specific Elements
Find by Tag
# Find first paragraph
paragraph = soup.find('p')
print(paragraph.text)
# Find all paragraphs
paragraphs = soup.find_all('p')
for p in paragraphs:
print(p.text)
Find by Class
# Find element by class
div = soup.find('div', class_='content')
print(div.text)
# Find all elements with specific class
items = soup.find_all('div', class_='item')
for item in items:
print(item.text)
Find by ID
# Find element by ID
header = soup.find(id='header')
print(header.text)
Find by Attributes
# Find by attribute
images = soup.find_all('img', src=True)
for img in images:
print(img['src'])
Complete Example: Scraping Product Information
import requests
from bs4 import BeautifulSoup
def scrape_products(url):
# Fetch webpage
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Find all product containers
products = soup.find_all('div', class_='product')
results = []
for product in products:
# Extract product name
name = product.find('h2', class_='product-name').text.strip()
# Extract price
price = product.find('span', class_='price').text.strip()
# Extract image URL
img = product.find('img')
image_url = img['src'] if img else None
results.append({
'name': name,
'price': price,
'image': image_url
})
return results
# Usage
products = scrape_products('https://example-shop.com/products')
for product in products:
print(f"{product['name']}: {product['price']}")
Handling Errors
import requests
from bs4 import BeautifulSoup
def safe_scrape(url):
try:
response = requests.get(url, timeout=10)
response.raise_for_status() # Raise exception for bad status codes
soup = BeautifulSoup(response.content, 'html.parser')
return soup
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
Best Practices
- Respect robots.txt: Check if scraping is allowed
- Add delays: Don’t overwhelm servers with requests
- Use headers: Set User-Agent to identify your scraper
- Handle errors: Always include error handling
- Be respectful: Don’t scrape too frequently
Advanced: Using CSS Selectors
# Select elements using CSS selectors
titles = soup.select('h1.title')
prices = soup.select('div.price')
links = soup.select('a[href^="https"]') # Links starting with https
Saving Scraped Data
import csv
import requests
from bs4 import BeautifulSoup
def scrape_and_save(url, filename):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('div', class_='item')
with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['Title', 'Link', 'Description'])
for item in items:
title = item.find('h2').text
link = item.find('a')['href']
desc = item.find('p').text
writer.writerow([title, link, desc])
# Usage
scrape_and_save('https://example.com', 'scraped_data.csv')
Resources
Web scraping with BeautifulSoup is a powerful skill for data extraction, automation, and research. Start with simple examples and gradually build more complex scrapers as you learn.