Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

Build a Complete Web Scraper with Requests and BeautifulSoup in Python

Scrape multiple paginated pages from a website using Requests and BeautifulSoup, with retry logic, error handling, and CSV export.

Medium Python 3.9+ Jun 27, 2026 Automation & scripting 6 views 0 copies

Requires third-party packages — install first
pip install requests beautifulsoup4

Python code

77 lines
Python 3.9+
import requests
from bs4 import BeautifulSoup
import csv
import time
from typing import List, Dict, Optional

class WebScraper:
    def __init__(self, base_url: str, output_file: str = "scraped_data.csv"):
        self.base_url = base_url
        self.output_file = output_file
        self.session = requests.Session()
        
    def fetch_page(self, url: str, retries: int = 3) -> Optional[BeautifulSoup]:
        for attempt in range(retries):
            try:
                response = self.session.get(url, timeout=10)
                response.raise_for_status()
                return BeautifulSoup(response.text, 'html.parser')
            except (requests.RequestException, Exception) as e:
                if attempt == retries - 1:
                    print(f"Error fetching {url}: {e}")
                    return None
                time.sleep(1)
                
    def parse_page(self, soup: BeautifulSoup) -> List[Dict[str, str]]:
        items = []
        for product in soup.select('.product-item'):
            item = {
                'name': product.select_one('.product-name').get_text(strip=True) if product.select_one('.product-name') else '',
                'price': product.select_one('.product-price').get_text(strip=True) if product.select_one('.product-price') else '',
                'rating': product.select_one('.rating').get_text(strip=True) if product.select_one('.rating') else ''
            }
            items.append(item)
        return items
    
    def get_next_page(self, soup: BeautifulSoup) -> Optional[str]:
        next_link = soup.select_one('a.next-page')
        if next_link and next_link.get('href'):
            return self.base_url + next_link['href']
        return None
    
    def scrape_all_pages(self) -> List[Dict[str, str]]:
        all_items = []
        current_url = self.base_url
        page_num = 1
        
        while current_url:
            print(f"Scraping page {page_num}: {current_url}")
            soup = self.fetch_page(current_url)
            if not soup:
                break
                
            items = self.parse_page(soup)
            all_items.extend(items)
            print(f"Found {len(items)} items on page {page_num}")
            
            current_url = self.get_next_page(soup)
            page_num += 1
            time.sleep(0.5)
            
        return all_items
    
    def export_to_csv(self, data: List[Dict[str, str]]):
        if not data:
            print("No data to export")
            return
            
        with open(self.output_file, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=['name', 'price', 'rating'])
            writer.writeheader()
            writer.writerows(data)
        print(f"Exported {len(data)} items to {self.output_file}")

if __name__ == "__main__":
    scraper = WebScraper("http://books.toscrape.com/catalogue/page-1.html")
    all_data = scraper.scrape_all_pages()
    scraper.export_to_csv(all_data)

Output

stdout
Scraping page 1: http://books.toscrape.com/catalogue/page-1.html
Found 20 items on page 1
Scraping page 2: http://books.toscrape.com/catalogue/page-2.html
Found 20 items on page 2
...
Scraping page 50: http://books.toscrape.com/catalogue/page-50.html
Found 20 items on page 50
Exported 1000 items to scraped_data.csv

How it works

The script creates a WebScraper class that holds a persistent requests.Session for efficiency. fetch_page retries up to 3 times with a 1-second delay on failure, returning None if all attempts fail. parse_page uses BeautifulSoup's .select() to find product cards and extracts name, price, and rating with .get_text(strip=True) inside each card. Pagination is handled by get_next_page, which looks for an <a class="next-page"> link and appends it to the base URL. After collecting all items across pages, export_to_csv writes them into a CSV file using csv.DictWriter.

Common mistakes

  • Forgetting to set a User-Agent header in the session can lead to blocking or default required headers issues.
  • Using `find`/`find_all` without a fallback when the element is missing — always chain `.get_text(strip=True)` with a conditional or default.
  • Not adding a delay between requests (`time.sleep`) may overload the server or trigger rate limiting.
  • Assuming the next-page link is always relative — need to handle both absolute and relative URLs properly.

Variations

  1. Use `aiohttp` with `asyncio` for asynchronous requests to scrape faster across many pages.
  2. Replace CSV export with SQLite insertion using `sqlite3` for persistent storage with query capabilities.

Real-world use cases

  • Collecting product listings and pricing data from e-commerce sites for competitive analysis.
  • Archiving job postings from a multi-page job board into a spreadsheet for offline review.
  • Monitoring changes in news headlines by scraping article titles across paginated archives daily.

Sponsored

Sponsored Reserved space — layout preview until AdSense is connected

Run locally

This sample needs third-party packages, so it cannot run in the browser IDE. Copy the code above, install the packages shown at the top, then run it in your own Python environment.

More from Automation & scripting

Related tutorials and quizzes for this topic.