Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

Find Duplicate Web Pages by Content Similarity in Python

Compute SHA-256 hashes of file contents to detect and report duplicate HTML pages or any files in a directory.

Medium Python 3.9+ Jun 28, 2026 Files & data 2 views 0 copies

Python code

51 lines
Python 3.9+
import hashlib
import os
from collections import defaultdict

def get_file_hash(filepath):
    """Compute SHA-256 hash of file contents."""
    sha256 = hashlib.sha256()
    with open(filepath, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b''):
            sha256.update(chunk)
    return sha256.hexdigest()

def find_duplicate_files(directory):
    """Find duplicate files in directory based on content hash."""
    hash_map = defaultdict(list)
    
    for root, dirs, files in os.walk(directory):
        for filename in files:
            filepath = os.path.join(root, filename)
            try:
                file_hash = get_file_hash(filepath)
                hash_map[file_hash].append(filepath)
            except (IOError, OSError):
                continue
    
    duplicates = {h: paths for h, paths in hash_map.items() if len(paths) > 1}
    return duplicates

if __name__ == "__main__":
    test_dir = "test_pages"
    os.makedirs(test_dir, exist_ok=True)
    
    # Create test files with identical content
    with open(os.path.join(test_dir, "page1.html"), 'w') as f:
        f.write("<html><body>Hello World</body></html>")
    with open(os.path.join(test_dir, "page2.html"), 'w') as f:
        f.write("<html><body>Hello World</body></html>")
    with open(os.path.join(test_dir, "unique.html"), 'w') as f:
        f.write("<html><body>Different content</body></html>")
    
    duplicates = find_duplicate_files(test_dir)
    
    print(f"Found {len(duplicates)} sets of duplicate files:")
    for hash_val, paths in duplicates.items():
        print(f"\nHash: {hash_val[:16]}...")
        for path in paths:
            print(f"  {path}")
    
    # Cleanup
    import shutil
    shutil.rmtree(test_dir)

Output

stdout
Found 1 sets of duplicate files:

Hash: 1b4f0e9851971998...
  test_pages/page1.html
  test_pages/page2.html

How it works

The script uses SHA-256 hashing via hashlib to create a unique fingerprint of each file's content. By reading files in chunks of 4096 bytes, it handles large files efficiently without loading them entirely into memory. os.walk recursively traverses all subdirectories, and a defaultdict(list) groups file paths by their hash. Finally, only hash groups with more than one file path are flagged as duplicates.

Common mistakes

  • Forgetting to read in binary mode ('rb') when hashing, which can cause encoding issues on non-text files.
  • Assuming files with the same hash are visually identical, ignoring byte-level differences like whitespace or metadata.
  • Not catching IOError or OSError, which may crash the script on permission-denied files.
  • Comparing full file content instead of using a streaming hash for memory efficiency.

Variations

  1. Use `filecmp.cmp` for a byte-by-byte comparison on a smaller set of files instead of hashing all files.
  2. Replace SHA-256 with MD5 for faster hashing when collision resistance is not critical.

Real-world use cases

  • Clean up a website backup by removing identical HTML files that were accidentally duplicated during archiving.
  • Identify duplicate user-uploaded images or documents in a storage bucket to reduce costs.
  • Detect copied web pages in a crawled dataset before feeding them into a search index or AI model.

Sponsored

Sponsored Reserved space — layout preview until AdSense is connected

Run this sample

Open the browser IDE to tweak the example and see results without installing anything.

Open editor

More from Files & data

Related tutorials and quizzes for this topic.