Extract Hyperlinks from Word Documents in Python
Parses a .docx file using Python's standard library to extract every hyperlink's display text and target URL.
Python code
63 linesimport zipfile
from pathlib import Path
import xml.etree.ElementTree as ET
def extract_hyperlinks_from_docx(filepath: str) -> list[dict]:
"""
Extract all hyperlinks from a .docx file.
Returns a list of dicts with 'text' and 'target' keys.
"""
hyperlinks = []
with zipfile.ZipFile(Path(filepath), 'r') as docx_zip:
# Word stores hyperlinks in relationships files and document.xml
try:
relationships_xml = docx_zip.read('word/_rels/document.xml.rels')
except KeyError:
return hyperlinks
rels_root = ET.fromstring(relationships_xml)
ns_rels = {'r': 'http://schemas.openxmlformats.org/package/2006/relationships'}
# Build mapping from relationship IDs to target URLs
id_to_target = {}
for rel in rels_root.findall('.//r:Relationship', ns_rels):
rid = rel.get('Id')
target = rel.get('Target')
if rid and target:
id_to_target[rid] = target
# Parse document.xml for hyperlink elements
try:
doc_xml = docx_zip.read('word/document.xml')
except KeyError:
return hyperlinks
doc_root = ET.fromstring(doc_xml)
ns_doc = {
'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main',
'r': 'http://schemas.openxmlformats.org/officeDocument/2006/relationships'
}
for hyperlink in doc_root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink'):
rid = hyperlink.get('{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id')
target = id_to_target.get(rid, '')
# Collect all text inside the hyperlink
text_parts = []
for run in hyperlink.findall('.//w:t', ns_doc):
if run.text:
text_parts.append(run.text)
text = ''.join(text_parts)
if target:
hyperlinks.append({'text': text, 'target': target})
return hyperlinks
if __name__ == "__main__":
# Example usage
import sys
if len(sys.argv) > 1:
results = extract_hyperlinks_from_docx(sys.argv[1])
for link in results:
print(f"Text: {link['text']!r} -> URL: {link['target']}")
else:
print("Please provide a .docx file path as argument")
Output
Text: 'Click here' -> URL: https://example.com
Text: 'Python docs' -> URL: https://docs.python.org
How it works
A .docx file is a ZIP archive containing XML files. Hyperlinks are stored across two files: word/_rels/document.xml.rels maps relationship IDs to target URLs, and word/document.xml contains <w:hyperlink> elements referencing those IDs. The code first reads the relationships file to build an ID-to-target dictionary, then parses the main document for hyperlink elements, collecting display text from child runs (<w:t>). By using only standard library modules (zipfile, xml.etree.ElementTree), no external dependencies are required, making the script portable and easy to run anywhere.
Common mistakes
- Forgetting that .docx is a ZIP file — trying to read it as raw XML fails
- Ignoring the relationships file and expecting the URL inside the hyperlink element directly
- Missing namespace prefixes in XML parsing, causing `findall` to return nothing
- Assuming every hyperlink has visible text — some are images or empty runs
Variations
- Use `lxml` for more robust XML parsing with XPath support on very large documents
- Process multiple .docx files in a loop to batch extract hyperlinks from a folder
Real-world use cases
- Auditing external links in legal documents before publishing or sharing with clients.
- Migrating content from legacy Word files into a web CMS by extracting embedded URLs.
- Security scanning a batch of .docx files for suspicious or broken hyperlinks in an enterprise environment.
Sponsored
More from Files & data
- Audit File Permissions Across a Project in Python easy
- Automatically Detect Corrupted Files Using SHA-256 Checksums in Python easy
- Automatically Highlight Data Validation Errors Inside Excel Files in Python easy
- Build a Command-Line To-Do List Application with Data Persistence in Python easy
- Build a Personal Work Hours Tracker in Python medium
- Build a Python Script That Detects and Deletes Empty Files Across Folders easy
Keep learning
Related tutorials and quizzes for this topic.