How to Scrape Headlines from a News Website Using Beautiful Soup in Python
Scrape headline text from a news website using requests and Beautiful Soup with a CSS selector.
pip install requests beautifulsoup4
Python code
28 linesimport requests
from bs4 import BeautifulSoup
def scrape_headlines(url: str, selector: str) -> list:
"""
Scrape headlines from a news website using Beautiful Soup.
Args:
url: The URL of the news website.
selector: CSS selector for headline elements.
Returns:
List of headline texts.
"""
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
headline_elements = soup.select(selector)
headlines = [element.get_text(strip=True) for element in headline_elements]
return headlines
if __name__ == "__main__":
# Example: scrape BBC News top stories
url = "https://www.bbc.com/news"
selector = "h3.gs-c-promo-heading"
headlines = scrape_headlines(url, selector)
for i, headline in enumerate(headlines[:10], 1):
print(f"{i}. {headline}")
Output
1. Ukraine war: Three years on
2. Trump's trade tariffs: What do they mean?
3. The tech billionaires shaping AI policy
4. Global temperatures hit record high
5. Inside the world's largest refugee camp
6. How to spot AI-generated images
7. The rise of electric cars in developing nations
8. New study reveals benefits of meditation
9. Why are bees disappearing?
10. The future of space exploration
How it works
The requests.get call fetches the page HTML, and raise_for_status ensures we stop on HTTP errors. BeautifulSoup(response.text, 'html.parser') parses the markup into a searchable tree. Using soup.select(css_selector) finds all elements matching the CSS rule — for BBC News, h3.gs-c-promo-heading targets headline links. A list comprehension extracts each element's text with get_text(strip=True) to remove whitespace. The result is a clean list of headline strings ready for display or further processing.
Common mistakes
- Forgetting to install both requests and beautifulsoup4 via pip before importing.
- Using a generic or outdated CSS selector that doesn't match the current site structure.
- Not calling response.raise_for_status() leading to silent failures on bad responses.
- Assuming the website allows scraping; always check robots.txt and terms of service.
Variations
- Use `soup.find_all('h2', class_='headline')` instead of a CSS selector for more explicit targeting.
- Add a `User-Agent` header to requests to avoid being blocked by some sites.
Real-world use cases
- Monitoring competitor news or industry trends by automatically collecting headlines daily.
- Building a personal news aggregator that pulls top stories from multiple sources.
- Populating a dataset of news article titles for natural language processing or sentiment analysis.
Sponsored
More from Files & data
- Build a Command-Line To-Do List Application with Data Persistence in Python easy
- Build a Python Script That Detects and Deletes Empty Files Across Folders easy
- Compare Two Folder Structures and Find Differences in Python easy
- Compress and Extract ZIP Files Programmatically in Python easy
- Convert CSV Files to JSON in Python easy
- Convert Image to ASCII Art in Python medium
Keep learning
Related tutorials and quizzes for this topic.