Extract Every Open Graph and Social Media Meta Tag from Web Pages in Python
A Python script that fetches a webpage and extracts all Open Graph, Twitter Card, Facebook, and Article meta tags using the standard library HTML parser.
Python code
41 linesfrom html.parser import HTMLParser
import re
from urllib.request import urlopen
from urllib.parse import urlparse
class MetaExtractor(HTMLParser):
def __init__(self):
super().__init__()
self.meta_tags = []
def handle_starttag(self, tag, attrs):
if tag == 'meta':
attrs_dict = dict(attrs)
prop = attrs_dict.get('property', '') or attrs_dict.get('name', '')
if any(prefix in prop for prefix in ['og:', 'twitter:', 'fb:', 'article:']):
content = attrs_dict.get('content', '')
if prop and content:
self.meta_tags.append((prop, content))
def get_meta(self):
return self.meta_tags
def extract_social_meta(url: str) -> list:
try:
with urlopen(url, timeout=5) as response:
html = response.read().decode('utf-8', errors='ignore')
extractor = MetaExtractor()
extractor.feed(html)
return extractor.get_meta()
except Exception as e:
print(f"Error fetching {url}: {e}")
return []
if __name__ == "__main__":
test_url = "https://example.com"
print(f"Extracting social meta tags from: {test_url}")
results = extract_social_meta(test_url)
for prop, content in results:
print(f"{prop}: {content}")
if not results:
print("No OpenGraph or social media meta tags found.")
Output
Extracting social meta tags from: https://example.com
No OpenGraph or social media meta tags found.
How it works
The code subclasses HTMLParser to override handle_starttag, building a dictionary of each <meta> tag's attributes. It checks the property or name attribute for prefixes like og:, twitter:, fb:, and article:, collecting only those with non-empty content. urlopen with a timeout handles network requests, and encoding errors are ignored to keep parsing stable. The feed method processes HTML incrementally, making the parser memory-efficient even for large pages.
Common mistakes
- Forgetting that Open Graph uses 'property' while Twitter Cards use 'name' — the code handles both.
- Not setting a timeout on urlopen, which can cause the script to hang on slow or unresponsive servers.
- Assuming HTML is always UTF-8 without providing a fallback like `errors='ignore'`.
- Overlooking uppercase or mixed-case attribute names — HTML case sensitivity can break parsing.
Variations
- Use BeautifulSoup with `soup.find_all('meta')` and attribute selectors for more flexible extraction.
- Add support for JSON-LD structured data by extracting `<script type="application/ld+json">` blocks.
Real-world use cases
- Building a link preview generator that fetches Open Graph tags to show rich snippets in chats or feeds.
- Scraping competitor websites programmatically to analyze their social media meta tag strategy.
- Validating that a new web page includes required Facebook and Twitter card meta tags before deployment.
Sponsored
More from Automation & scripting
- Automatically Clean Temporary Files from Applications Using Python medium
- Automatically Download the Latest Software Release from GitHub with Python medium
- Automatically Generate Charts from CSV Files with One Command medium
- Automatically Generate Hardware Inventory Reports in Python easy
- Automatically Log CPU, RAM, and Disk Usage Every Minute in Python easy
- Batch Rename Hundreds of Files in Python easy
Keep learning
Related tutorials and quizzes for this topic.