Find Duplicate Web Pages by Content Similarity in Python

Compute SHA-256 hashes of file contents to detect and report duplicate HTML pages or any files in a directory.

Medium Python 3.9+ Jun 28, 2026 Files & data 2 views 0 copies

duplicate-detection hashing sha256 file-processing os-walk

Python code

51 lines

Python 3.9+

import hashlib
import os
from collections import defaultdict

def get_file_hash(filepath):
    """Compute SHA-256 hash of file contents."""
    sha256 = hashlib.sha256()
    with open(filepath, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            sha256.update(chunk)
    return sha256.hexdigest()

def find_duplicate_files(directory):
    """Find duplicate files in directory based on content hash."""
    hash_map = defaultdict(list)
    
    for root, dirs, files in os.walk(directory):
        for filename in files:
            filepath = os.path.join(root, filename)
            try:
                file_hash = get_file_hash(filepath)
                hash_map[file_hash].append(filepath)
            except (IOError, OSError):
                continue
    
    duplicates = {h: paths for h, paths in hash_map.items() if len(paths) > 1}
    return duplicates

if __name__ == "__main__":
    test_dir = "test_pages"
    os.makedirs(test_dir, exist_ok=True)
    
    # Create test files with identical content
    with open(os.path.join(test_dir, "page1.html"), 'w') as f:
        f.write("<html><body>Hello World</body></html>")
    with open(os.path.join(test_dir, "page2.html"), 'w') as f:
        f.write("<html><body>Hello World</body></html>")
    with open(os.path.join(test_dir, "unique.html"), 'w') as f:
        f.write("<html><body>Different content</body></html>")
    
    duplicates = find_duplicate_files(test_dir)
    
    print(f"Found {len(duplicates)} sets of duplicate files:")
    for hash_val, paths in duplicates.items():
        print(f"\nHash: {hash_val[:16]}...")
        for path in paths:
            print(f"  {path}")
    
    # Cleanup
    import shutil
    shutil.rmtree(test_dir)

Output

stdout

Found 1 sets of duplicate files:

Hash: 1b4f0e9851971998...
  test_pages/page1.html
  test_pages/page2.html

How it works

The script uses SHA-256 hashing via hashlib to create a unique fingerprint of each file's content. By reading files in chunks of 4096 bytes, it handles large files efficiently without loading them entirely into memory. os.walk recursively traverses all subdirectories, and a defaultdict(list) groups file paths by their hash. Finally, only hash groups with more than one file path are flagged as duplicates.

Common mistakes

Forgetting to read in binary mode ('rb') when hashing, which can cause encoding issues on non-text files.
Assuming files with the same hash are visually identical, ignoring byte-level differences like whitespace or metadata.
Not catching IOError or OSError, which may crash the script on permission-denied files.
Comparing full file content instead of using a streaming hash for memory efficiency.

Variations

Use `filecmp.cmp` for a byte-by-byte comparison on a smaller set of files instead of hashing all files.
Replace SHA-256 with MD5 for faster hashing when collision resistance is not critical.

Real-world use cases

Clean up a website backup by removing identical HTML files that were accidentally duplicated during archiving.
Identify duplicate user-uploaded images or documents in a storage bucket to reduce costs.
Detect copied web pages in a crawled dataset before feeding them into a search index or AI model.

Find Duplicate Web Pages by Content Similarity in Python

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Files & data

Quizzes

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Files & data

Keep learning

Quizzes