Find Broken Image References Across a Website in Python
Crawl internal pages of a website, collect all image source URLs, then check each with HEAD requests to report any that return HTTP 4xx or connection errors.
pip install requests beautifulsoup4
Python code
51 linesimport requests
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
def find_all_links(base_url, max_pages=50):
visited, to_visit = set(), {base_url}
while to_visit and len(visited) < max_pages:
url = to_visit.pop()
if url in visited:
continue
visited.add(url)
try:
resp = requests.get(url, timeout=10)
if 'text/html' not in resp.headers.get('Content-Type', ''):
continue
soup = BeautifulSoup(resp.text, 'html.parser')
for tag in soup.find_all(['img', 'source']):
src = tag.get('src') or tag.get('srcset', '').split()[0]
if src:
full_url = urljoin(url, src)
if full_url.startswith(base_url.rstrip('/')):
yield full_url
for a in soup.find_all('a', href=True):
full_url = urljoin(url, a['href'])
if full_url.startswith(base_url.rstrip('/')) and full_url not in visited:
to_visit.add(full_url)
except requests.RequestException:
pass
def check_images(base_url):
image_urls = set(find_all_links(base_url))
broken = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(requests.head, url, timeout=5): url for url in image_urls}
for future in as_completed(futures):
url = futures[future]
try:
resp = future.result()
if resp.status_code >= 400:
broken.append((url, resp.status_code))
except requests.RequestException:
broken.append((url, 'Connection error'))
broken.sort(key=lambda x: x[1] if isinstance(x[1], int) else 999)
return broken
if __name__ == '__main__':
site = 'https://example.com'
broken_images = check_images(site)
for url, status in broken_images:
print(f'{status} {url}')
Output
404 https://example.com/images/missing.png
403 https://example.com/assets/forbidden.jpg
Connection error https://example.com/img/unreachable.svg
How it works
The crawler uses BeautifulSoup to parse every HTML page it visits and finds <img> and <source> tags. Image URLs are made absolute with urljoin. To avoid overloading the server or scanning unboundedly, max_pages=50 limits the crawl. Image checks are parallelised with ThreadPoolExecutor so that dozens of HEAD requests run concurrently, dramatically reducing total time. Results are sorted by status code for a readable report.
Common mistakes
- Forgetting to call `urljoin` — relative image paths will be checked against the wrong base.
- Not filtering image URLs to the same domain — the script would check external images, wasting time and risking false positives.
- Using `requests.get` instead of `requests.head` for image checks — downloading every image is slow and needless.
- Omitting `Content-Type` check — non-HTML resources like PDFs may contain stray 'src' attributes and crash the parser.
Variations
- Replace `requests.head` with `requests.get` + `stream=True` to also verify the image is not truncated.
- Add a progress indicator using `tqdm` when checking many images.
Real-world use cases
- Scheduled CI job that catches broken images before they reach production after a site migration.
- Content audit script that scans a marketing website for missing product images before a campaign launch.
- Onboarding check for new clients — verify all uploaded assets appear correctly on their staging site.
Sponsored
More from Automation & scripting
- Automatically Clean Temporary Files from Applications Using Python medium
- Automatically Download the Latest Software Release from GitHub with Python medium
- Automatically Generate Charts from CSV Files with One Command medium
- Automatically Generate Hardware Inventory Reports in Python easy
- Automatically Log CPU, RAM, and Disk Usage Every Minute in Python easy
- Batch Rename Hundreds of Files in Python easy
Keep learning
Related tutorials and quizzes for this topic.