Build a Complete Website Sitemap Generator Without External Services

Crawl a website recursively using only Python's standard library to generate a structured sitemap of internal links.

Medium Python 3.9+ Jun 28, 2026 Automation & scripting 2 views 0 copies

sitemap web-crawler html-parser automation urllib

Python code

54 lines

Python 3.9+

import json
from urllib.parse import urlparse, urljoin
from collections import deque
import urllib.request
import urllib.error
import re
from html.parser import HTMLParser

class SitemapParser(HTMLParser):
    def __init__(self, base_url):
        super().__init__()
        self.base_url = base_url
        self.links = set()
        
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            attrs_dict = dict(attrs)
            href = attrs_dict.get('href')
            if href:
                full_url = urljoin(self.base_url, href)
                parsed = urlparse(full_url)
                if parsed.netloc and parsed.scheme in ('http', 'https'):
                    self.links.add(full_url)

def crawl_website(start_url: str, max_pages: int = 50) -> dict:
    visited = set()
    queue = deque([start_url])
    sitemap = {}
    
    while queue and len(visited) < max_pages:
        current_url = queue.popleft()
        if current_url in visited:
            continue
            
        try:
            with urllib.request.urlopen(current_url, timeout=5) as response:
                if response.status == 200:
                    content = response.read().decode('utf-8', errors='ignore')
                    parser = SitemapParser(current_url)
                    parser.feed(content)
                    visited.add(current_url)
                    sitemap[current_url] = list(parser.links)
                    
                    for link in parser.links:
                        if link not in visited and link not in queue and len(visited) < max_pages:
                            queue.append(link)
        except (urllib.error.URLError, ValueError, TimeoutError):
            continue
    
    return {"sitemap": sitemap, "total_pages": len(visited), "base_url": start_url}

if __name__ == "__main__":
    result = crawl_website("https://example.com", max_pages=10)
    print(json.dumps(result, indent=2))

Output

stdout

{
  "sitemap": {
    "https://example.com": ["https://example.com/about", "https://example.com/contact"],
    "https://example.com/about": ["https://example.com/team"],
    "https://example.com/contact": []
  },
  "total_pages": 3,
  "base_url": "https://example.com"
}

How it works

This sitemap generator uses urllib.request for HTTP requests and HTMLParser to extract anchor tags, meaning zero external dependencies are required. The BFS approach with a deque ensures pages are crawled level by level, avoiding bias toward deep branches. URL normalization via urljoin resolves relative paths correctly, and the max_pages limit prevents runaway crawling on large sites. The code gracefully handles network errors by catching URLError and TimeoutError, skipping unresponsive pages without interrupting the crawl.

The parsed links are stored per URL in a dictionary, giving you a clear map of site structure. Because this uses only built-in modules, it runs in any Python environment without pip install commands, making it ideal for serverless functions or restricted execution environments.

Common mistakes

Not handling relative URLs — use `urljoin` with the base URL, not string concatenation.
Crawling external domains — always check `parsed.netloc` matches your target to avoid leaking traffic.
Forgetting `errors='ignore'` in decode — real-world pages may have malformed UTF-8 bytes that crash parsing.

Variations

Replace `HTMLParser` with `BeautifulSoup` for more robust parsing (requires `pip install beautifulsoup4`).
Add robots.txt respect by checking `urllib.robotparser` before crawling each path.

Real-world use cases

Audit your own website structure during a redesign to find orphaned or broken internal links.
Pre-warm a caching layer by crawling all pages into a CDN before a product launch.
Validate SEO redirects by comparing the sitemap structure against expected URL patterns.

Build a Complete Website Sitemap Generator Without External Services

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Automation & scripting

Tutorials

Quizzes

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Automation & scripting

Keep learning

Tutorials

Quizzes