Python BeautifulSoup Tutorial: Web scraping in 20 lines of code

Web scraping is the process of extracting data from websites. Python, combined with BeautifulSoup, makes web scraping simple and efficient. In this tutorial, you’ll learn how to scrape a web page in just 20 lines of code.

What is BeautifulSoup?

BeautifulSoup is a Python library that makes it easy to scrape information from web pages. It sits on top of an HTML or XML parser and provides Pythonic idioms for iterating, searching, and modifying the parse tree.

Installation

First, install the required libraries:

pip install beautifulsoup4 requests

Or if you prefer:

pip install beautifulsoup4 lxml

Basic Web Scraping Example

Here’s a simple example that scrapes the title and all links from a webpage:

import requests
from bs4 import BeautifulSoup

# Fetch the webpage
url = 'https://example.com'
response = requests.get(url)
html_content = response.content

# Parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Extract title
title = soup.find('title').text
print(f"Page Title: {title}")

# Extract all links
links = soup.find_all('a')
for link in links:
    href = link.get('href')
    text = link.text
    print(f"{text}: {href}")

Scraping Specific Elements

Find by Tag

# Find first paragraph
paragraph = soup.find('p')
print(paragraph.text)

# Find all paragraphs
paragraphs = soup.find_all('p')
for p in paragraphs:
    print(p.text)

Find by Class

# Find element by class
div = soup.find('div', class_='content')
print(div.text)

# Find all elements with specific class
items = soup.find_all('div', class_='item')
for item in items:
    print(item.text)

Find by ID

# Find element by ID
header = soup.find(id='header')
print(header.text)

Find by Attributes

# Find by attribute
images = soup.find_all('img', src=True)
for img in images:
    print(img['src'])

Complete Example: Scraping Product Information

import requests
from bs4 import BeautifulSoup

def scrape_products(url):
    # Fetch webpage
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all product containers
    products = soup.find_all('div', class_='product')
    
    results = []
    for product in products:
        # Extract product name
        name = product.find('h2', class_='product-name').text.strip()
        
        # Extract price
        price = product.find('span', class_='price').text.strip()
        
        # Extract image URL
        img = product.find('img')
        image_url = img['src'] if img else None
        
        results.append({
            'name': name,
            'price': price,
            'image': image_url
        })
    
    return results

# Usage
products = scrape_products('https://example-shop.com/products')
for product in products:
    print(f"{product['name']}: {product['price']}")

Handling Errors

import requests
from bs4 import BeautifulSoup

def safe_scrape(url):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()  # Raise exception for bad status codes
        
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return None

Best Practices

Respect robots.txt: Check if scraping is allowed
Add delays: Don’t overwhelm servers with requests
Use headers: Set User-Agent to identify your scraper
Handle errors: Always include error handling
Be respectful: Don’t scrape too frequently

Advanced: Using CSS Selectors

# Select elements using CSS selectors
titles = soup.select('h1.title')
prices = soup.select('div.price')
links = soup.select('a[href^="https"]')  # Links starting with https

Saving Scraped Data

import csv
import requests
from bs4 import BeautifulSoup

def scrape_and_save(url, filename):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    items = soup.find_all('div', class_='item')
    
    with open(filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Title', 'Link', 'Description'])
        
        for item in items:
            title = item.find('h2').text
            link = item.find('a')['href']
            desc = item.find('p').text
            writer.writerow([title, link, desc])

# Usage
scrape_and_save('https://example.com', 'scraped_data.csv')

Resources

Web scraping with BeautifulSoup is a powerful skill for data extraction, automation, and research. Start with simple examples and gradually build more complex scrapers as you learn.