Find Broken Image References Across a Website in Python

Crawl internal pages of a website, collect all image source URLs, then check each with HEAD requests to report any that return HTTP 4xx or connection errors.

Medium Python 3.9+ Jun 28, 2026 Automation & scripting 2 views 0 copies

web scraping crawling broken links beautifulsoup requests concurrency

Requires third-party packages — install first

pip install requests beautifulsoup4

Python code

51 lines

Python 3.9+

import requests
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed

def find_all_links(base_url, max_pages=50):
    visited, to_visit = set(), {base_url}
    while to_visit and len(visited) < max_pages:
        url = to_visit.pop()
        if url in visited:
            continue
        visited.add(url)
        try:
            resp = requests.get(url, timeout=10)
            if 'text/html' not in resp.headers.get('Content-Type', ''):
                continue
            soup = BeautifulSoup(resp.text, 'html.parser')
            for tag in soup.find_all(['img', 'source']):
                src = tag.get('src') or tag.get('srcset', '').split()[0]
                if src:
                    full_url = urljoin(url, src)
                    if full_url.startswith(base_url.rstrip('/')):
                        yield full_url
            for a in soup.find_all('a', href=True):
                full_url = urljoin(url, a['href'])
                if full_url.startswith(base_url.rstrip('/')) and full_url not in visited:
                    to_visit.add(full_url)
        except requests.RequestException:
            pass

def check_images(base_url):
    image_urls = set(find_all_links(base_url))
    broken = []
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = {executor.submit(requests.head, url, timeout=5): url for url in image_urls}
        for future in as_completed(futures):
            url = futures[future]
            try:
                resp = future.result()
                if resp.status_code >= 400:
                    broken.append((url, resp.status_code))
            except requests.RequestException:
                broken.append((url, 'Connection error'))
    broken.sort(key=lambda x: x[1] if isinstance(x[1], int) else 999)
    return broken

if __name__ == '__main__':
    site = 'https://example.com'
    broken_images = check_images(site)
    for url, status in broken_images:
        print(f'{status} {url}')

Output

stdout

404 https://example.com/images/missing.png
403 https://example.com/assets/forbidden.jpg
Connection error https://example.com/img/unreachable.svg

How it works

The crawler uses BeautifulSoup to parse every HTML page it visits and finds <img> and <source> tags. Image URLs are made absolute with urljoin. To avoid overloading the server or scanning unboundedly, max_pages=50 limits the crawl. Image checks are parallelised with ThreadPoolExecutor so that dozens of HEAD requests run concurrently, dramatically reducing total time. Results are sorted by status code for a readable report.

Common mistakes

Forgetting to call `urljoin` — relative image paths will be checked against the wrong base.
Not filtering image URLs to the same domain — the script would check external images, wasting time and risking false positives.
Using `requests.get` instead of `requests.head` for image checks — downloading every image is slow and needless.
Omitting `Content-Type` check — non-HTML resources like PDFs may contain stray 'src' attributes and crash the parser.

Variations

Replace `requests.head` with `requests.get` + `stream=True` to also verify the image is not truncated.
Add a progress indicator using `tqdm` when checking many images.

Real-world use cases

Scheduled CI job that catches broken images before they reach production after a site migration.
Content audit script that scans a marketing website for missing product images before a campaign launch.
Onboarding check for new clients — verify all uploaded assets appear correctly on their staging site.

Find Broken Image References Across a Website in Python

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Automation & scripting

Quizzes

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Automation & scripting

Keep learning

Quizzes