Find Duplicate Web Pages by Content Similarity in Python
Compute SHA-256 hashes of file contents to detect and report duplicate HTML pages or any files in a directory.
Python code
51 linesimport hashlib
import os
from collections import defaultdict
def get_file_hash(filepath):
"""Compute SHA-256 hash of file contents."""
sha256 = hashlib.sha256()
with open(filepath, 'rb') as f:
for chunk in iter(lambda: f.read(4096), b''):
sha256.update(chunk)
return sha256.hexdigest()
def find_duplicate_files(directory):
"""Find duplicate files in directory based on content hash."""
hash_map = defaultdict(list)
for root, dirs, files in os.walk(directory):
for filename in files:
filepath = os.path.join(root, filename)
try:
file_hash = get_file_hash(filepath)
hash_map[file_hash].append(filepath)
except (IOError, OSError):
continue
duplicates = {h: paths for h, paths in hash_map.items() if len(paths) > 1}
return duplicates
if __name__ == "__main__":
test_dir = "test_pages"
os.makedirs(test_dir, exist_ok=True)
# Create test files with identical content
with open(os.path.join(test_dir, "page1.html"), 'w') as f:
f.write("<html><body>Hello World</body></html>")
with open(os.path.join(test_dir, "page2.html"), 'w') as f:
f.write("<html><body>Hello World</body></html>")
with open(os.path.join(test_dir, "unique.html"), 'w') as f:
f.write("<html><body>Different content</body></html>")
duplicates = find_duplicate_files(test_dir)
print(f"Found {len(duplicates)} sets of duplicate files:")
for hash_val, paths in duplicates.items():
print(f"\nHash: {hash_val[:16]}...")
for path in paths:
print(f" {path}")
# Cleanup
import shutil
shutil.rmtree(test_dir)
Output
Found 1 sets of duplicate files:
Hash: 1b4f0e9851971998...
test_pages/page1.html
test_pages/page2.html
How it works
The script uses SHA-256 hashing via hashlib to create a unique fingerprint of each file's content. By reading files in chunks of 4096 bytes, it handles large files efficiently without loading them entirely into memory. os.walk recursively traverses all subdirectories, and a defaultdict(list) groups file paths by their hash. Finally, only hash groups with more than one file path are flagged as duplicates.
Common mistakes
- Forgetting to read in binary mode ('rb') when hashing, which can cause encoding issues on non-text files.
- Assuming files with the same hash are visually identical, ignoring byte-level differences like whitespace or metadata.
- Not catching IOError or OSError, which may crash the script on permission-denied files.
- Comparing full file content instead of using a streaming hash for memory efficiency.
Variations
- Use `filecmp.cmp` for a byte-by-byte comparison on a smaller set of files instead of hashing all files.
- Replace SHA-256 with MD5 for faster hashing when collision resistance is not critical.
Real-world use cases
- Clean up a website backup by removing identical HTML files that were accidentally duplicated during archiving.
- Identify duplicate user-uploaded images or documents in a storage bucket to reduce costs.
- Detect copied web pages in a crawled dataset before feeding them into a search index or AI model.
Sponsored
More from Files & data
- Audit File Permissions Across a Project in Python easy
- Automatically Detect Corrupted Files Using SHA-256 Checksums in Python easy
- Automatically Highlight Data Validation Errors Inside Excel Files in Python easy
- Build a Command-Line To-Do List Application with Data Persistence in Python easy
- Build a Personal Work Hours Tracker in Python medium
- Build a Python Script That Detects and Deletes Empty Files Across Folders easy
Keep learning
Related tutorials and quizzes for this topic.