Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

Extract Schema.org Structured Data from Any Website in Python

A Python tool that fetches a webpage and extracts all JSON-LD structured data (Schema.org) embedded in <script> tags with type="application/ld+json".

Medium Python 3.9+ Jun 28, 2026 Data pipelines & processing 2 views 0 copies

Requires third-party packages — install first
pip install requests beautifulsoup4

Python code

34 lines
Python 3.9+
import requests
from bs4 import BeautifulSoup
import json

def extract_schema_org(url):
    """Extract structured data (Schema.org) from a website."""
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        return {"error": f"Failed to fetch URL: {e}"}

    soup = BeautifulSoup(response.text, 'html.parser')
    schema_data = []

    # Find all script tags with type="application/ld+json"
    for script in soup.find_all('script', type='application/ld+json'):
        try:
            data = json.loads(script.string)
            schema_data.append(data)
        except (json.JSONDecodeError, TypeError):
            continue

    if not schema_data:
        # Fall back to checking for other common Schema patterns (e.g., microdata)
        # For simplicity, return empty list if no JSON-LD found
        return {"message": "No Schema.org data found (JSON-LD format).", "data": []}

    return {"url": url, "schema_count": len(schema_data), "data": schema_data}

if __name__ == "__main__":
    test_url = "https://schema.org/docs/gs.html"  # Example page that may contain Schema.org data
    result = extract_schema_org(test_url)
    print(json.dumps(result, indent=2, default=str))

Output

stdout
{
  "url": "https://schema.org/docs/gs.html",
  "schema_count": 1,
  "data": [
    {
      "@context": "https://schema.org",
      "@type": "TechArticle",
      "name": "Getting Started with Schema.org"
    }
  ]
}

How it works

Many websites embed structured data in JSON-LD format inside <script type="application/ld+json"> tags. The code fetches the page with requests, parses the HTML with BeautifulSoup, locates all such script tags, and parses their content as JSON. If no JSON-LD data is found, it returns an empty list but could be extended to also parse microdata or RDFa. Error handling ensures network failures don't crash the script.

Common mistakes

  • Forgetting to install both requests and beautifulsoup4 (pip install requests beautifulsoup4)
  • Assuming all schema data is JSON-LD — many sites use microdata or RDFa instead
  • Not handling nested or malformed JSON within script tags
  • Overlooking script tags with extra whitespace or line breaks inside the JSON string

Variations

  1. Use httpx instead of requests for async support
  2. Extend to also parse microdata by finding HTML attributes like itemscope and itemprop

Real-world use cases

  • SEO auditing: verify that your blog pages have proper Schema.org markup for rich snippets.
  • Competitor analysis: extract structured data from competitor product pages to understand their markup patterns.
  • Data ingestion: pull event listings or job postings from multiple sites that use JSON-LD schema.

Sponsored

Sponsored Reserved space — layout preview until AdSense is connected

Run locally

This sample needs third-party packages, so it cannot run in the browser IDE. Copy the code above, install the packages shown at the top, then run it in your own Python environment.

More from Data pipelines & processing

Related tutorials and quizzes for this topic.