Build a Complete Website Sitemap Generator Without External Services
Crawl a website recursively using only Python's standard library to generate a structured sitemap of internal links.
Python code
54 linesimport json
from urllib.parse import urlparse, urljoin
from collections import deque
import urllib.request
import urllib.error
import re
from html.parser import HTMLParser
class SitemapParser(HTMLParser):
def __init__(self, base_url):
super().__init__()
self.base_url = base_url
self.links = set()
def handle_starttag(self, tag, attrs):
if tag == 'a':
attrs_dict = dict(attrs)
href = attrs_dict.get('href')
if href:
full_url = urljoin(self.base_url, href)
parsed = urlparse(full_url)
if parsed.netloc and parsed.scheme in ('http', 'https'):
self.links.add(full_url)
def crawl_website(start_url: str, max_pages: int = 50) -> dict:
visited = set()
queue = deque([start_url])
sitemap = {}
while queue and len(visited) < max_pages:
current_url = queue.popleft()
if current_url in visited:
continue
try:
with urllib.request.urlopen(current_url, timeout=5) as response:
if response.status == 200:
content = response.read().decode('utf-8', errors='ignore')
parser = SitemapParser(current_url)
parser.feed(content)
visited.add(current_url)
sitemap[current_url] = list(parser.links)
for link in parser.links:
if link not in visited and link not in queue and len(visited) < max_pages:
queue.append(link)
except (urllib.error.URLError, ValueError, TimeoutError):
continue
return {"sitemap": sitemap, "total_pages": len(visited), "base_url": start_url}
if __name__ == "__main__":
result = crawl_website("https://example.com", max_pages=10)
print(json.dumps(result, indent=2))
Output
{
"sitemap": {
"https://example.com": ["https://example.com/about", "https://example.com/contact"],
"https://example.com/about": ["https://example.com/team"],
"https://example.com/contact": []
},
"total_pages": 3,
"base_url": "https://example.com"
}
How it works
This sitemap generator uses urllib.request for HTTP requests and HTMLParser to extract anchor tags, meaning zero external dependencies are required. The BFS approach with a deque ensures pages are crawled level by level, avoiding bias toward deep branches. URL normalization via urljoin resolves relative paths correctly, and the max_pages limit prevents runaway crawling on large sites. The code gracefully handles network errors by catching URLError and TimeoutError, skipping unresponsive pages without interrupting the crawl.
The parsed links are stored per URL in a dictionary, giving you a clear map of site structure. Because this uses only built-in modules, it runs in any Python environment without pip install commands, making it ideal for serverless functions or restricted execution environments.
Common mistakes
- Not handling relative URLs — use `urljoin` with the base URL, not string concatenation.
- Crawling external domains — always check `parsed.netloc` matches your target to avoid leaking traffic.
- Forgetting `errors='ignore'` in decode — real-world pages may have malformed UTF-8 bytes that crash parsing.
Variations
- Replace `HTMLParser` with `BeautifulSoup` for more robust parsing (requires `pip install beautifulsoup4`).
- Add robots.txt respect by checking `urllib.robotparser` before crawling each path.
Real-world use cases
- Audit your own website structure during a redesign to find orphaned or broken internal links.
- Pre-warm a caching layer by crawling all pages into a CDN before a product launch.
- Validate SEO redirects by comparing the sitemap structure against expected URL patterns.
Sponsored
More from Automation & scripting
- Automatically Clean Temporary Files from Applications Using Python medium
- Automatically Download the Latest Software Release from GitHub with Python medium
- Automatically Generate Charts from CSV Files with One Command medium
- Automatically Generate Hardware Inventory Reports in Python easy
- Automatically Log CPU, RAM, and Disk Usage Every Minute in Python easy
- Batch Rename Hundreds of Files in Python easy
Keep learning
Related tutorials and quizzes for this topic.