Build a Complete Web Scraper with Requests and BeautifulSoup in Python
Scrape multiple paginated pages from a website using Requests and BeautifulSoup, with retry logic, error handling, and CSV export.
pip install requests beautifulsoup4
Python code
77 linesimport requests
from bs4 import BeautifulSoup
import csv
import time
from typing import List, Dict, Optional
class WebScraper:
def __init__(self, base_url: str, output_file: str = "scraped_data.csv"):
self.base_url = base_url
self.output_file = output_file
self.session = requests.Session()
def fetch_page(self, url: str, retries: int = 3) -> Optional[BeautifulSoup]:
for attempt in range(retries):
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
return BeautifulSoup(response.text, 'html.parser')
except (requests.RequestException, Exception) as e:
if attempt == retries - 1:
print(f"Error fetching {url}: {e}")
return None
time.sleep(1)
def parse_page(self, soup: BeautifulSoup) -> List[Dict[str, str]]:
items = []
for product in soup.select('.product-item'):
item = {
'name': product.select_one('.product-name').get_text(strip=True) if product.select_one('.product-name') else '',
'price': product.select_one('.product-price').get_text(strip=True) if product.select_one('.product-price') else '',
'rating': product.select_one('.rating').get_text(strip=True) if product.select_one('.rating') else ''
}
items.append(item)
return items
def get_next_page(self, soup: BeautifulSoup) -> Optional[str]:
next_link = soup.select_one('a.next-page')
if next_link and next_link.get('href'):
return self.base_url + next_link['href']
return None
def scrape_all_pages(self) -> List[Dict[str, str]]:
all_items = []
current_url = self.base_url
page_num = 1
while current_url:
print(f"Scraping page {page_num}: {current_url}")
soup = self.fetch_page(current_url)
if not soup:
break
items = self.parse_page(soup)
all_items.extend(items)
print(f"Found {len(items)} items on page {page_num}")
current_url = self.get_next_page(soup)
page_num += 1
time.sleep(0.5)
return all_items
def export_to_csv(self, data: List[Dict[str, str]]):
if not data:
print("No data to export")
return
with open(self.output_file, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['name', 'price', 'rating'])
writer.writeheader()
writer.writerows(data)
print(f"Exported {len(data)} items to {self.output_file}")
if __name__ == "__main__":
scraper = WebScraper("http://books.toscrape.com/catalogue/page-1.html")
all_data = scraper.scrape_all_pages()
scraper.export_to_csv(all_data)
Output
Scraping page 1: http://books.toscrape.com/catalogue/page-1.html
Found 20 items on page 1
Scraping page 2: http://books.toscrape.com/catalogue/page-2.html
Found 20 items on page 2
...
Scraping page 50: http://books.toscrape.com/catalogue/page-50.html
Found 20 items on page 50
Exported 1000 items to scraped_data.csv
How it works
The script creates a WebScraper class that holds a persistent requests.Session for efficiency. fetch_page retries up to 3 times with a 1-second delay on failure, returning None if all attempts fail. parse_page uses BeautifulSoup's .select() to find product cards and extracts name, price, and rating with .get_text(strip=True) inside each card. Pagination is handled by get_next_page, which looks for an <a class="next-page"> link and appends it to the base URL. After collecting all items across pages, export_to_csv writes them into a CSV file using csv.DictWriter.
Common mistakes
- Forgetting to set a User-Agent header in the session can lead to blocking or default required headers issues.
- Using `find`/`find_all` without a fallback when the element is missing — always chain `.get_text(strip=True)` with a conditional or default.
- Not adding a delay between requests (`time.sleep`) may overload the server or trigger rate limiting.
- Assuming the next-page link is always relative — need to handle both absolute and relative URLs properly.
Variations
- Use `aiohttp` with `asyncio` for asynchronous requests to scrape faster across many pages.
- Replace CSV export with SQLite insertion using `sqlite3` for persistent storage with query capabilities.
Real-world use cases
- Collecting product listings and pricing data from e-commerce sites for competitive analysis.
- Archiving job postings from a multi-page job board into a spreadsheet for offline review.
- Monitoring changes in news headlines by scraping article titles across paginated archives daily.
Sponsored
More from Automation & scripting
- Batch Rename Hundreds of Files in Python easy
- Build a Command-Line Password Generator in Python easy
- Build a Network Ping Monitor in Python medium
- Create a Local Search Engine to Instantly Find Files on Your Computer in Python medium
- Create a Simple HTTP File Server in Python easy
- Detect and Remove Blurry Images in Python with OpenCV medium
Keep learning
Related tutorials and quizzes for this topic.