Extract Schema.org Structured Data from Any Website in Python

A Python tool that fetches a webpage and extracts all JSON-LD structured data (Schema.org) embedded in <script> tags with type="application/ld+json".

Medium Python 3.9+ Jun 28, 2026 Data pipelines & processing 2 views 0 copies

web-scraping structured-data schema-org json-ld requests beautifulsoup

Requires third-party packages — install first

pip install requests beautifulsoup4

Python code

34 lines

Python 3.9+

import requests
from bs4 import BeautifulSoup
import json

def extract_schema_org(url):
    """Extract structured data (Schema.org) from a website."""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        return {"error": f"Failed to fetch URL: {e}"}

    soup = BeautifulSoup(response.text, 'html.parser')
    schema_data = []

    # Find all script tags with type="application/ld+json"
    for script in soup.find_all('script', type='application/ld+json'):
        try:
            data = json.loads(script.string)
            schema_data.append(data)
        except (json.JSONDecodeError, TypeError):
            continue

    if not schema_data:
        # Fall back to checking for other common Schema patterns (e.g., microdata)
        # For simplicity, return empty list if no JSON-LD found
        return {"message": "No Schema.org data found (JSON-LD format).", "data": []}

    return {"url": url, "schema_count": len(schema_data), "data": schema_data}

if __name__ == "__main__":
    test_url = "https://schema.org/docs/gs.html"  # Example page that may contain Schema.org data
    result = extract_schema_org(test_url)
    print(json.dumps(result, indent=2, default=str))

Output

stdout

{
  "url": "https://schema.org/docs/gs.html",
  "schema_count": 1,
  "data": [
    {
      "@context": "https://schema.org",
      "@type": "TechArticle",
      "name": "Getting Started with Schema.org"
    }
  ]
}

How it works

Many websites embed structured data in JSON-LD format inside <script type="application/ld+json"> tags. The code fetches the page with requests, parses the HTML with BeautifulSoup, locates all such script tags, and parses their content as JSON. If no JSON-LD data is found, it returns an empty list but could be extended to also parse microdata or RDFa. Error handling ensures network failures don't crash the script.

Common mistakes

Forgetting to install both requests and beautifulsoup4 (pip install requests beautifulsoup4)
Assuming all schema data is JSON-LD — many sites use microdata or RDFa instead
Not handling nested or malformed JSON within script tags
Overlooking script tags with extra whitespace or line breaks inside the JSON string

Variations

Use httpx instead of requests for async support
Extend to also parse microdata by finding HTML attributes like itemscope and itemprop

Real-world use cases

SEO auditing: verify that your blog pages have proper Schema.org markup for rich snippets.
Competitor analysis: extract structured data from competitor product pages to understand their markup patterns.
Data ingestion: pull event listings or job postings from multiple sites that use JSON-LD schema.

Extract Schema.org Structured Data from Any Website in Python

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Data pipelines & processing

Tutorials

Quizzes

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Data pipelines & processing

Keep learning

Tutorials

Quizzes