Discover RSS Feeds From Any Website in Python
Scrape a website's HTML to automatically find all linked RSS or Atom feed URLs using requests, BeautifulSoup, and regex.
pip install requests beautifulsoup4
Python code
44 linesimport requests
import re
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup
def discover_rss_feeds(url):
"""Discover all RSS/Atom feeds linked from a given website."""
try:
headers = {'User-Agent': 'Mozilla/5.0 (compatible; RSSDiscovery/1.0)'}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return []
soup = BeautifulSoup(response.text, 'html.parser')
feeds = set()
# Find <link> tags with RSS/Atom types
for link in soup.find_all('link', type=re.compile(r'application/(rss|atom)\+xml', re.I)):
href = link.get('href')
if href:
feeds.add(urljoin(url, href))
# Find <a> tags linking to .rss or .xml or containing/feed in URL
for a in soup.find_all('a', href=True):
href = a['href']
if re.search(r'\.(rss|xml)$', href, re.I) or '/feed' in href.lower():
full_url = urljoin(url, href)
if urlparse(full_url).netloc == urlparse(url).netloc:
feeds.add(full_url)
return sorted(feeds)
if __name__ == "__main__":
# Example usage
website = "https://news.ycombinator.com"
feeds = discover_rss_feeds(website)
if feeds:
print(f"Found {len(feeds)} feed(s) on {website}:")
for feed in feeds:
print(f" {feed}")
else:
print(f"No feeds discovered on {website}")
Output
Found 2 feed(s) on https://news.ycombinator.com:
https://news.ycombinator.com/rss
https://news.ycombinator.com/atom.xml
How it works
This script fetches a webpage with a polite User-Agent header and parses it with BeautifulSoup. It collects feed URLs from <link> tags whose type attribute matches application/rss+xml or application/atom+xml, and from <a> tags whose href ends with .rss, .xml, or contains /feed. All relative URLs are resolved to absolute using urljoin, and only links staying on the same domain are kept to avoid external noise. The result is a sorted, deduplicated list of discovered feeds.
The script handles HTTP errors gracefully and returns an empty list when no feeds are found, making it robust for batch or scheduled scanning.
Common mistakes
- Forgetting to handle relative URLs with urljoin — leads to broken or incomplete feed links.
- Not filtering by same domain — picks up external feed links from widgets or embeds.
- Using a too-strict regex that misses feeds served with `.xml` or paths containing `/feed/`.
- Skipping a custom User-Agent — some sites block scripts with default Python user agents.
Variations
- Use `feedparser` to validate discovered URLs by attempting to parse them as actual feeds.
- Extend discovery to check common well-known paths like `/rss`, `/feed`, or `/atom.xml` even if not linked on the page.
Real-world use cases
- Building a content aggregator that automatically subscribes to feeds from bookmarked websites.
- Monitoring competitor blogs or news sites by discovering and fetching their latest RSS feeds.
- Automating podcast feed discovery from a list of show homepage URLs to populate a directory.
Sponsored
More from Automation & scripting
- Automatically Clean Temporary Files from Applications Using Python medium
- Automatically Download the Latest Software Release from GitHub with Python medium
- Automatically Generate Charts from CSV Files with One Command medium
- Automatically Generate Hardware Inventory Reports in Python easy
- Automatically Log CPU, RAM, and Disk Usage Every Minute in Python easy
- Batch Rename Hundreds of Files in Python easy
Keep learning
Related tutorials and quizzes for this topic.