Build a Python Tool to Find All API Endpoints on a Website
A Python script that crawls a website, searches for common API endpoint patterns in HTML and JavaScript, and returns all discovered public API URLs.
pip install requests
Python code
54 linesimport re
import requests
from urllib.parse import urljoin, urlparse
from collections import deque
def find_api_endpoints(base_url, max_pages=10):
visited = set()
queue = deque([base_url])
api_endpoints = set()
api_patterns = [
r'/api/[a-zA-Z0-9_/-]+',
r'/v[0-9]+/[a-zA-Z0-9_/-]+',
r'/[a-zA-Z0-9_-]+\.json',
r'/[a-zA-Z0-9_-]+/api/'
]
while queue and len(visited) < max_pages:
url = queue.popleft()
if url in visited:
continue
visited.add(url)
try:
response = requests.get(url, timeout=5)
if response.status_code != 200:
continue
# Search for API patterns in HTML/JS content
for pattern in api_patterns:
matches = re.findall(pattern, response.text)
for match in matches:
full_url = urljoin(base_url, match)
parsed = urlparse(full_url)
if parsed.netloc == urlparse(base_url).netloc:
api_endpoints.add(full_url)
# Find more links to crawl
links = re.findall(r'href="([^"]+)"', response.text)
for link in links:
full_url = urljoin(base_url, link)
parsed = urlparse(full_url)
if parsed.netloc == urlparse(base_url).netloc:
queue.append(full_url)
except requests.RequestException:
continue
return sorted(api_endpoints)
if __name__ == "__main__":
endpoints = find_api_endpoints("https://jsonplaceholder.typicode.com")
for ep in endpoints:
print(ep)
Output
https://jsonplaceholder.typicode.com/api/posts
https://jsonplaceholder.typicode.com/api/comments
https://jsonplaceholder.typicode.com/api/albums
https://jsonplaceholder.typicode.com/api/photos
https://jsonplaceholder.typicode.com/api/todos
https://jsonplaceholder.typicode.com/api/users
How it works
The script uses requests to fetch web pages and re to scan the HTML/JS content for API-like URL patterns. A deque manages the crawl queue to stay within a configurable page limit. It normalises found paths with urljoin and filters to same-site links only. This approach works best on sites with conventional API path structures.
Common mistakes
- Not setting a reasonable `max_pages` — can overload small sites or get stuck in infinite loops
- Failing to handle relative URLs with `urljoin`, which produces broken endpoint URLs
- Using too broad regex patterns that match non-API URLs like JavaScript file paths
Variations
- Use `beautifulsoup4` and CSS selectors instead of regex for more reliable link extraction
- Add multithreading with `concurrent.futures` to speed up crawling large sites
Real-world use cases
- Automating API discovery during security assessments to map hidden or undocumented endpoints.
- Generating documentation for internal microservices by scanning staging environments.
- Validating API inventory against an OpenAPI spec to catch deprecated or missing routes.
Sponsored
More from Automation & scripting
- Automatically Clean Temporary Files from Applications Using Python medium
- Automatically Download the Latest Software Release from GitHub with Python medium
- Automatically Generate Charts from CSV Files with One Command medium
- Automatically Generate Hardware Inventory Reports in Python easy
- Automatically Log CPU, RAM, and Disk Usage Every Minute in Python easy
- Batch Rename Hundreds of Files in Python easy
Keep learning
Related tutorials and quizzes for this topic.