Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

Build a Complete Website Sitemap Generator Without External Services

Crawl a website recursively using only Python's standard library to generate a structured sitemap of internal links.

Medium Python 3.9+ Jun 28, 2026 Automation & scripting 2 views 0 copies

Python code

54 lines
Python 3.9+
import json
from urllib.parse import urlparse, urljoin
from collections import deque
import urllib.request
import urllib.error
import re
from html.parser import HTMLParser

class SitemapParser(HTMLParser):
    def __init__(self, base_url):
        super().__init__()
        self.base_url = base_url
        self.links = set()
        
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            attrs_dict = dict(attrs)
            href = attrs_dict.get('href')
            if href:
                full_url = urljoin(self.base_url, href)
                parsed = urlparse(full_url)
                if parsed.netloc and parsed.scheme in ('http', 'https'):
                    self.links.add(full_url)

def crawl_website(start_url: str, max_pages: int = 50) -> dict:
    visited = set()
    queue = deque([start_url])
    sitemap = {}
    
    while queue and len(visited) < max_pages:
        current_url = queue.popleft()
        if current_url in visited:
            continue
            
        try:
            with urllib.request.urlopen(current_url, timeout=5) as response:
                if response.status == 200:
                    content = response.read().decode('utf-8', errors='ignore')
                    parser = SitemapParser(current_url)
                    parser.feed(content)
                    visited.add(current_url)
                    sitemap[current_url] = list(parser.links)
                    
                    for link in parser.links:
                        if link not in visited and link not in queue and len(visited) < max_pages:
                            queue.append(link)
        except (urllib.error.URLError, ValueError, TimeoutError):
            continue
    
    return {"sitemap": sitemap, "total_pages": len(visited), "base_url": start_url}

if __name__ == "__main__":
    result = crawl_website("https://example.com", max_pages=10)
    print(json.dumps(result, indent=2))

Output

stdout
{
  "sitemap": {
    "https://example.com": ["https://example.com/about", "https://example.com/contact"],
    "https://example.com/about": ["https://example.com/team"],
    "https://example.com/contact": []
  },
  "total_pages": 3,
  "base_url": "https://example.com"
}

How it works

This sitemap generator uses urllib.request for HTTP requests and HTMLParser to extract anchor tags, meaning zero external dependencies are required. The BFS approach with a deque ensures pages are crawled level by level, avoiding bias toward deep branches. URL normalization via urljoin resolves relative paths correctly, and the max_pages limit prevents runaway crawling on large sites. The code gracefully handles network errors by catching URLError and TimeoutError, skipping unresponsive pages without interrupting the crawl.

The parsed links are stored per URL in a dictionary, giving you a clear map of site structure. Because this uses only built-in modules, it runs in any Python environment without pip install commands, making it ideal for serverless functions or restricted execution environments.

Common mistakes

  • Not handling relative URLs — use `urljoin` with the base URL, not string concatenation.
  • Crawling external domains — always check `parsed.netloc` matches your target to avoid leaking traffic.
  • Forgetting `errors='ignore'` in decode — real-world pages may have malformed UTF-8 bytes that crash parsing.

Variations

  1. Replace `HTMLParser` with `BeautifulSoup` for more robust parsing (requires `pip install beautifulsoup4`).
  2. Add robots.txt respect by checking `urllib.robotparser` before crawling each path.

Real-world use cases

  • Audit your own website structure during a redesign to find orphaned or broken internal links.
  • Pre-warm a caching layer by crawling all pages into a CDN before a product launch.
  • Validate SEO redirects by comparing the sitemap structure against expected URL patterns.

Sponsored

Sponsored Reserved space — layout preview until AdSense is connected

Run this sample

Open the browser IDE to tweak the example and see results without installing anything.

Open editor

More from Automation & scripting

Related tutorials and quizzes for this topic.