Extract Hyperlinks from Word Documents in Python

Parses a .docx file using Python's standard library to extract every hyperlink's display text and target URL.

Medium Python 3.9+ Jun 28, 2026 Files & data 2 views 0 copies

docx hyperlinks xml zipfile file-parsing word

Python code

63 lines

Python 3.9+

import zipfile
from pathlib import Path
import xml.etree.ElementTree as ET

def extract_hyperlinks_from_docx(filepath: str) -> list[dict]:
    """
    Extract all hyperlinks from a .docx file.
    Returns a list of dicts with 'text' and 'target' keys.
    """
    hyperlinks = []
    with zipfile.ZipFile(Path(filepath), 'r') as docx_zip:
        # Word stores hyperlinks in relationships files and document.xml
        try:
            relationships_xml = docx_zip.read('word/_rels/document.xml.rels')
        except KeyError:
            return hyperlinks
        
        rels_root = ET.fromstring(relationships_xml)
        ns_rels = {'r': 'http://schemas.openxmlformats.org/package/2006/relationships'}
        
        # Build mapping from relationship IDs to target URLs
        id_to_target = {}
        for rel in rels_root.findall('.//r:Relationship', ns_rels):
            rid = rel.get('Id')
            target = rel.get('Target')
            if rid and target:
                id_to_target[rid] = target
        
        # Parse document.xml for hyperlink elements
        try:
            doc_xml = docx_zip.read('word/document.xml')
        except KeyError:
            return hyperlinks
        
        doc_root = ET.fromstring(doc_xml)
        ns_doc = {
            'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main',
            'r': 'http://schemas.openxmlformats.org/officeDocument/2006/relationships'
        }
        
        for hyperlink in doc_root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink'):
            rid = hyperlink.get('{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id')
            target = id_to_target.get(rid, '')
            # Collect all text inside the hyperlink
            text_parts = []
            for run in hyperlink.findall('.//w:t', ns_doc):
                if run.text:
                    text_parts.append(run.text)
            text = ''.join(text_parts)
            if target:
                hyperlinks.append({'text': text, 'target': target})
    
    return hyperlinks

if __name__ == "__main__":
    # Example usage
    import sys
    if len(sys.argv) > 1:
        results = extract_hyperlinks_from_docx(sys.argv[1])
        for link in results:
            print(f"Text: {link['text']!r} -> URL: {link['target']}")
    else:
        print("Please provide a .docx file path as argument")

Output

stdout

Text: 'Click here' -> URL: https://example.com
Text: 'Python docs' -> URL: https://docs.python.org

How it works

A .docx file is a ZIP archive containing XML files. Hyperlinks are stored across two files: word/_rels/document.xml.rels maps relationship IDs to target URLs, and word/document.xml contains <w:hyperlink> elements referencing those IDs. The code first reads the relationships file to build an ID-to-target dictionary, then parses the main document for hyperlink elements, collecting display text from child runs (<w:t>). By using only standard library modules (zipfile, xml.etree.ElementTree), no external dependencies are required, making the script portable and easy to run anywhere.

Common mistakes

Forgetting that .docx is a ZIP file — trying to read it as raw XML fails
Ignoring the relationships file and expecting the URL inside the hyperlink element directly
Missing namespace prefixes in XML parsing, causing `findall` to return nothing
Assuming every hyperlink has visible text — some are images or empty runs

Variations

Use `lxml` for more robust XML parsing with XPath support on very large documents
Process multiple .docx files in a loop to batch extract hyperlinks from a folder

Real-world use cases

Auditing external links in legal documents before publishing or sharing with clients.
Migrating content from legacy Word files into a web CMS by extracting embedded URLs.
Security scanning a batch of .docx files for suspicious or broken hyperlinks in an enterprise environment.

Extract Hyperlinks from Word Documents in Python

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Files & data

Tutorials

Quizzes

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Files & data

Keep learning

Tutorials

Quizzes