Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

Find Broken Image References Across a Website in Python

Crawl internal pages of a website, collect all image source URLs, then check each with HEAD requests to report any that return HTTP 4xx or connection errors.

Medium Python 3.9+ Jun 28, 2026 Automation & scripting 2 views 0 copies

Requires third-party packages — install first
pip install requests beautifulsoup4

Python code

51 lines
Python 3.9+
import requests
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

def find_all_links(base_url, max_pages=50):
    visited, to_visit = set(), {base_url}
    while to_visit and len(visited) < max_pages:
        url = to_visit.pop()
        if url in visited:
            continue
        visited.add(url)
        try:
            resp = requests.get(url, timeout=10)
            if 'text/html' not in resp.headers.get('Content-Type', ''):
                continue
            soup = BeautifulSoup(resp.text, 'html.parser')
            for tag in soup.find_all(['img', 'source']):
                src = tag.get('src') or tag.get('srcset', '').split()[0]
                if src:
                    full_url = urljoin(url, src)
                    if full_url.startswith(base_url.rstrip('/')):
                        yield full_url
            for a in soup.find_all('a', href=True):
                full_url = urljoin(url, a['href'])
                if full_url.startswith(base_url.rstrip('/')) and full_url not in visited:
                    to_visit.add(full_url)
        except requests.RequestException:
            pass

def check_images(base_url):
    image_urls = set(find_all_links(base_url))
    broken = []
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = {executor.submit(requests.head, url, timeout=5): url for url in image_urls}
        for future in as_completed(futures):
            url = futures[future]
            try:
                resp = future.result()
                if resp.status_code >= 400:
                    broken.append((url, resp.status_code))
            except requests.RequestException:
                broken.append((url, 'Connection error'))
    broken.sort(key=lambda x: x[1] if isinstance(x[1], int) else 999)
    return broken

if __name__ == '__main__':
    site = 'https://example.com'
    broken_images = check_images(site)
    for url, status in broken_images:
        print(f'{status} {url}')

Output

stdout
404 https://example.com/images/missing.png
403 https://example.com/assets/forbidden.jpg
Connection error https://example.com/img/unreachable.svg

How it works

The crawler uses BeautifulSoup to parse every HTML page it visits and finds <img> and <source> tags. Image URLs are made absolute with urljoin. To avoid overloading the server or scanning unboundedly, max_pages=50 limits the crawl. Image checks are parallelised with ThreadPoolExecutor so that dozens of HEAD requests run concurrently, dramatically reducing total time. Results are sorted by status code for a readable report.

Common mistakes

  • Forgetting to call `urljoin` — relative image paths will be checked against the wrong base.
  • Not filtering image URLs to the same domain — the script would check external images, wasting time and risking false positives.
  • Using `requests.get` instead of `requests.head` for image checks — downloading every image is slow and needless.
  • Omitting `Content-Type` check — non-HTML resources like PDFs may contain stray 'src' attributes and crash the parser.

Variations

  1. Replace `requests.head` with `requests.get` + `stream=True` to also verify the image is not truncated.
  2. Add a progress indicator using `tqdm` when checking many images.

Real-world use cases

  • Scheduled CI job that catches broken images before they reach production after a site migration.
  • Content audit script that scans a marketing website for missing product images before a campaign launch.
  • Onboarding check for new clients — verify all uploaded assets appear correctly on their staging site.

Sponsored

Sponsored Reserved space — layout preview until AdSense is connected

Run locally

This sample needs third-party packages, so it cannot run in the browser IDE. Copy the code above, install the packages shown at the top, then run it in your own Python environment.

More from Automation & scripting

Related tutorials and quizzes for this topic.