Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

Extract All Links from Any Website in Python

Scrape a webpage and extract all absolute HTTP/HTTPS links using requests and regex.

Medium Python 3.9+ Jun 27, 2026 Automation & scripting 1 views 0 copies

Requires third-party packages — install first
pip install requests

Python code

25 lines
Python 3.9+
import requests
import re
from urllib.parse import urljoin

def extract_links(url):
    try:
        response = requests.get(url)
        response.raise_for_status()
        html = response.text
        # Find all href attributes in anchor tags
        pattern = r'href=["\'](.*?)["\']'
        raw_links = re.findall(pattern, html, re.IGNORECASE)
        absolute_links = set()
        for link in raw_links:
            absolute = urljoin(url, link)
            if absolute.startswith('http://') or absolute.startswith('https://'):
                absolute_links.add(absolute)
        return sorted(absolute_links)
    except requests.exceptions.RequestException as e:
        return [f"Error: {e}"]

if __name__ == "__main__":
    links = extract_links("https://example.com")
    for link in links[:10]:  # Show first 10 links
        print(link)

Output

stdout
https://www.iana.org/domains/example

How it works

The requests.get() fetches the page HTML. A regex pattern href=["'](.*?)["'] captures every link inside an href attribute. urljoin() converts relative URLs (like /about) to absolute ones. The script filters only http:// or https:// schemes and deduplicates results with a set. Finally it returns sorted links; the demo prints the first 10.

Common mistakes

  • Not using `urljoin` — relative links don't work without base URL expansion.
  • Forgetting to handle `RequestException` — network errors crash the script.
  • Using a case-sensitive regex and missing uppercase `HREF`.

Variations

  1. Replace regex with `BeautifulSoup` and `soup.find_all('a')` for more robust HTML parsing.
  2. Fetch the page with `httpx` (async) for concurrent scraping of multiple pages.

Real-world use cases

  • Auditing internal links on a website to find broken or outdated URLs before launch.
  • Building a sitemap generator that crawls a domain and lists all reachable pages.
  • Monitoring competitor sites for new blog posts or product pages by extracting links daily.

Sponsored

Sponsored Reserved space — layout preview until AdSense is connected

Run locally

This sample needs third-party packages, so it cannot run in the browser IDE. Copy the code above, install the packages shown at the top, then run it in your own Python environment.

More from Automation & scripting

Related tutorials and quizzes for this topic.