Extract Schema.org Structured Data from Any Website in Python
A Python tool that fetches a webpage and extracts all JSON-LD structured data (Schema.org) embedded in <script> tags with type="application/ld+json".
pip install requests beautifulsoup4
Python code
34 linesimport requests
from bs4 import BeautifulSoup
import json
def extract_schema_org(url):
"""Extract structured data (Schema.org) from a website."""
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except requests.exceptions.RequestException as e:
return {"error": f"Failed to fetch URL: {e}"}
soup = BeautifulSoup(response.text, 'html.parser')
schema_data = []
# Find all script tags with type="application/ld+json"
for script in soup.find_all('script', type='application/ld+json'):
try:
data = json.loads(script.string)
schema_data.append(data)
except (json.JSONDecodeError, TypeError):
continue
if not schema_data:
# Fall back to checking for other common Schema patterns (e.g., microdata)
# For simplicity, return empty list if no JSON-LD found
return {"message": "No Schema.org data found (JSON-LD format).", "data": []}
return {"url": url, "schema_count": len(schema_data), "data": schema_data}
if __name__ == "__main__":
test_url = "https://schema.org/docs/gs.html" # Example page that may contain Schema.org data
result = extract_schema_org(test_url)
print(json.dumps(result, indent=2, default=str))
Output
{
"url": "https://schema.org/docs/gs.html",
"schema_count": 1,
"data": [
{
"@context": "https://schema.org",
"@type": "TechArticle",
"name": "Getting Started with Schema.org"
}
]
}
How it works
Many websites embed structured data in JSON-LD format inside <script type="application/ld+json"> tags. The code fetches the page with requests, parses the HTML with BeautifulSoup, locates all such script tags, and parses their content as JSON. If no JSON-LD data is found, it returns an empty list but could be extended to also parse microdata or RDFa. Error handling ensures network failures don't crash the script.
Common mistakes
- Forgetting to install both requests and beautifulsoup4 (pip install requests beautifulsoup4)
- Assuming all schema data is JSON-LD — many sites use microdata or RDFa instead
- Not handling nested or malformed JSON within script tags
- Overlooking script tags with extra whitespace or line breaks inside the JSON string
Variations
- Use httpx instead of requests for async support
- Extend to also parse microdata by finding HTML attributes like itemscope and itemprop
Real-world use cases
- SEO auditing: verify that your blog pages have proper Schema.org markup for rich snippets.
- Competitor analysis: extract structured data from competitor product pages to understand their markup patterns.
- Data ingestion: pull event listings or job postings from multiple sites that use JSON-LD schema.
Sponsored
More from Data pipelines & processing
Keep learning
Related tutorials and quizzes for this topic.