How to Detect PII in Documents Using Python
Use regex patterns to automatically detect emails, phone numbers, SSNs, and credit card numbers in text documents.
Python code
27 linesimport re
from typing import List, Dict
def detect_pii(text: str) -> Dict[str, List[str]]:
patterns = {
"email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
"phone": r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}",
"ssn": r"\b\d{3}-\d{2}-\d{4}\b",
"credit_card": r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b"
}
found = {}
for label, pattern in patterns.items():
matches = re.findall(pattern, text)
if matches:
found[label] = matches
return found
if __name__ == "__main__":
sample = """
John's email is john.doe@example.com and his SSN is 123-45-6789.
Call him at (555) 123-4567 or 555-987-6543.
Credit card: 4111 1111 1111 1111.
"""
result = detect_pii(sample)
for pii_type, values in result.items():
print(f"{pii_type}: {values}")
Output
email: ['john.doe@example.com']
phone: ['(555) 123-4567', '555-987-6543']
ssn: ['123-45-6789']
credit_card: ['4111 1111 1111 1111']
How it works
The function compiles regex patterns for each PII type using re.findall, which returns all matches. The results are stored in a dictionary keyed by the label. Each pattern uses word boundaries (\b) where appropriate to avoid false positives. The function only returns types that found matches, keeping the output clean.
Common mistakes
- Forgetting word boundaries around SSN and credit card patterns, causing partial matches within longer numbers.
- Assuming all phone numbers follow the same format, missing international or alternative delimiters.
- Not accounting for variations in credit card spacing (dash vs space vs no separator).
Variations
- Use the `validators` library for more robust email and phone validation.
- Extend patterns to include IP addresses, passport numbers, or user-defined custom types.
Real-world use cases
- Scanning uploaded documents in a web app to flag or redact sensitive data before storage.
- Preprocessing customer support chats to mask PII before logging for analysis.
- Automating compliance checks in email or file archives for regulatory requirements like GDPR or HIPAA.
Sponsored
More from Strings & text
Keep learning
Related tutorials and quizzes for this topic.