Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

How to Detect PII in Documents Using Python

Use regex patterns to automatically detect emails, phone numbers, SSNs, and credit card numbers in text documents.

Easy Python 3.9+ Jun 28, 2026 Strings & text 3 views 0 copies

Python code

27 lines
Python 3.9+
import re
from typing import List, Dict

def detect_pii(text: str) -> Dict[str, List[str]]:
    patterns = {
        "email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
        "phone": r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}",
        "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
        "credit_card": r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b"
    }
    
    found = {}
    for label, pattern in patterns.items():
        matches = re.findall(pattern, text)
        if matches:
            found[label] = matches
    return found

if __name__ == "__main__":
    sample = """
    John's email is john.doe@example.com and his SSN is 123-45-6789.
    Call him at (555) 123-4567 or 555-987-6543. 
    Credit card: 4111 1111 1111 1111.
    """
    result = detect_pii(sample)
    for pii_type, values in result.items():
        print(f"{pii_type}: {values}")

Output

stdout
email: ['john.doe@example.com']
phone: ['(555) 123-4567', '555-987-6543']
ssn: ['123-45-6789']
credit_card: ['4111 1111 1111 1111']

How it works

The function compiles regex patterns for each PII type using re.findall, which returns all matches. The results are stored in a dictionary keyed by the label. Each pattern uses word boundaries (\b) where appropriate to avoid false positives. The function only returns types that found matches, keeping the output clean.

Common mistakes

  • Forgetting word boundaries around SSN and credit card patterns, causing partial matches within longer numbers.
  • Assuming all phone numbers follow the same format, missing international or alternative delimiters.
  • Not accounting for variations in credit card spacing (dash vs space vs no separator).

Variations

  1. Use the `validators` library for more robust email and phone validation.
  2. Extend patterns to include IP addresses, passport numbers, or user-defined custom types.

Real-world use cases

  • Scanning uploaded documents in a web app to flag or redact sensitive data before storage.
  • Preprocessing customer support chats to mask PII before logging for analysis.
  • Automating compliance checks in email or file archives for regulatory requirements like GDPR or HIPAA.

Sponsored

Sponsored Reserved space — layout preview until AdSense is connected

Run this sample

Open the browser IDE to tweak the example and see results without installing anything.

Open editor

More from Strings & text

Related tutorials and quizzes for this topic.