Find All Redirects on a Website in Python
Crawl a website from a starting URL, follow links within the same domain, and detect every HTTP redirect (301, 302, 303, 307, 308) using requests with redirects disabled.
pip install requests
Python code
36 linesimport requests
from urllib.parse import urljoin, urlparse
from collections import deque
def find_redirects(start_url, max_pages=50):
visited = set()
redirects = {}
queue = deque([start_url])
while queue and len(visited) < max_pages:
url = queue.popleft()
if url in visited:
continue
try:
response = requests.get(url, allow_redirects=False, timeout=10)
visited.add(url)
if response.status_code in (301, 302, 303, 307, 308):
redirect_url = response.headers.get('Location', '')
redirect_url = urljoin(url, redirect_url)
redirects[url] = redirect_url
if urlparse(redirect_url).netloc == urlparse(start_url).netloc:
queue.append(redirect_url)
elif response.status_code == 200:
for link in response.links.values():
full_url = urljoin(url, link['url'])
if urlparse(full_url).netloc == urlparse(start_url).netloc:
queue.append(full_url)
except requests.RequestException:
continue
return redirects
if __name__ == "__main__":
start = "https://httpbin.org/redirect-to?url=https%3A%2F%2Fexample.com"
results = find_redirects(start, max_pages=10)
for source, target in results.items():
print(f"{source} -> {target}")
Output
https://httpbin.org/redirect-to?url=https%3A%2F%2Fexample.com -> https://example.com
How it works
The find_redirects function uses BFS to explore pages while respecting max_pages. allow_redirects=False prevents requests from automatically following redirects, letting the code capture the Location header. It then resolves relative URLs with urljoin and enforces same-domain crawling via urlparse(redirect_url).netloc == urlparse(start_url).netloc. Only HTTP 200 responses are scanned for further links (using response.links), and all non-200/non-redirect statuses are silently skipped. The function returns a dict mapping source URLs to their redirect targets.
Common mistakes
- Forgetting to set `allow_redirects=False`, causing requests to follow redirects automatically and missing them entirely.
- Not using `urljoin` to resolve relative URLs found in `Location` headers or page links.
- Extracting links manually instead of using `response.links` which parses the HTML `<link>` tags from the `Link` header.
- Failing to limit crawling with `max_pages` or a visited set, leading to infinite loops or large uncontrolled crawls.
Variations
- Use `BeautifulSoup` to parse the response HTML for `<a>` tags and extract all links instead of relying on `response.links`.
- Add support for crawling links from `<iframe>`, `<frame>`, or `sitemap.xml` files.
Real-world use cases
- Auditing a website for broken or outdated redirect chains before a migration or SEO overhaul.
- Monitoring a site's redirect structure to ensure no accidental redirect loops are introduced during deployment.
- Enumerating all URL redirects in a web application for security penetration testing (e.g., open redirect detection).
Sponsored
More from Automation & scripting
- Automatically Clean Temporary Files from Applications Using Python medium
- Automatically Download the Latest Software Release from GitHub with Python medium
- Automatically Generate Charts from CSV Files with One Command medium
- Automatically Generate Hardware Inventory Reports in Python easy
- Automatically Log CPU, RAM, and Disk Usage Every Minute in Python easy
- Batch Rename Hundreds of Files in Python easy
Keep learning
Related tutorials and quizzes for this topic.