How to Create a Link Graph Visualization for Any Website in Python
A Python script that crawls a website's internal links, builds a directed graph of parent-child URL relationships, and prints the graph to the console.
pip install requests beautifulsoup4
Python code
51 linesimport requests
from bs4 import BeautifulSoup
from collections import defaultdict
from urllib.parse import urljoin, urlparse
import sys
def get_links(url, max_links=20):
try:
response = requests.get(url, timeout=5)
soup = BeautifulSoup(response.text, 'html.parser')
base_url = f"{urlparse(url).scheme}://{urlparse(url).netloc}"
links = set()
for a_tag in soup.find_all('a', href=True):
href = a_tag['href']
full_url = urljoin(base_url, href)
if urlparse(full_url).netloc == urlparse(url).netloc:
links.add(full_url)
if len(links) >= max_links:
break
return list(links)
except Exception as e:
print(f"Error: {e}")
return []
def build_graph(start_url, max_nodes=15):
graph = defaultdict(list)
visited = set()
to_visit = [start_url]
while to_visit and len(visited) < max_nodes:
current = to_visit.pop(0)
if current in visited:
continue
visited.add(current)
print(f"Crawling: {current}")
links = get_links(current)
for link in links[:5]:
if link not in visited:
graph[current].append(link)
to_visit.append(link)
return graph
def print_graph(graph):
print("\nLink Graph (parent -> child):")
for parent, children in graph.items():
for child in children:
print(f" {parent} -> {child}")
if __name__ == "__main__":
start_url = sys.argv[1] if len(sys.argv) > 1 else "https://example.com"
graph = build_graph(start_url)
print_graph(graph)
Output
Crawling: https://example.com
Crawling: https://example.com/page1
Crawling: https://example.com/page2
Link Graph (parent -> child):
https://example.com -> https://example.com/page1
https://example.com -> https://example.com/page2
https://example.com/page1 -> https://example.com/subpage
How it works
The script uses requests to fetch a page and BeautifulSoup to parse HTML. It finds all internal links (same domain) by checking urlparse(netloc). The build_graph function performs a breadth-first crawl, keeping a list of visited URLs to avoid loops. Each discovered internal link becomes a child in the graph dictionary. The printed output shows a parent-child relationship that represents the site's navigational structure.
Common mistakes
- Not filtering internal links properly, causing external domains to pollute the graph
- Using a crawler speed that overwhelns the target server or gets blocked (no rate limiting)
- Ignoring duplicate URLs or URLs with trailing slashes that resolve to the same page
Variations
- Use NetworkX to visualize the graph as an image (draw_networkx)
- Export the graph to a JSON or CSV file for further analysis
Real-world use cases
- Auditing a large website's internal linking structure for SEO improvements.
- Mapping a wiki or documentation site to understand content connectivity.
- Building a crawler that discovers broken links by checking if child pages return 404.
Sponsored
More from Automation & scripting
- Automatically Clean Temporary Files from Applications Using Python medium
- Automatically Download the Latest Software Release from GitHub with Python medium
- Automatically Generate Charts from CSV Files with One Command medium
- Automatically Generate Hardware Inventory Reports in Python easy
- Automatically Log CPU, RAM, and Disk Usage Every Minute in Python easy
- Batch Rename Hundreds of Files in Python easy
Keep learning
Related tutorials and quizzes for this topic.