Find and Delete Duplicate Files Using Hashing in Python
Walk a directory tree, compute SHA256 hashes for every file, and delete duplicates that share the same hash.
Python code
45 linesimport hashlib
import os
from pathlib import Path
def file_hash(path, block_size=65536):
"""Return SHA256 hash of file content."""
hasher = hashlib.sha256()
with open(path, 'rb') as f:
while chunk := f.read(block_size):
hasher.update(chunk)
return hasher.hexdigest()
def find_and_delete_duplicates(directory):
"""Find and delete duplicate files in directory (recursively)."""
hashes = {}
duplicates = []
for filepath in Path(directory).rglob('*'):
if filepath.is_file():
h = file_hash(filepath)
if h in hashes:
duplicates.append(filepath)
print(f"Deleting duplicate: {filepath}")
filepath.unlink()
else:
hashes[h] = filepath
return duplicates
if __name__ == "__main__":
# Example: create test files in a temporary directory
import tempfile
import time
with tempfile.TemporaryDirectory() as tmpdir:
# Create two identical files
f1 = Path(tmpdir) / "file1.txt"
f2 = Path(tmpdir) / "file2.txt"
f1.write_text("Hello World")
time.sleep(0.1) # Ensure different timestamps
f2.write_text("Hello World")
# Create a unique file
f3 = Path(tmpdir) / "unique.txt"
f3.write_text("Unique content")
dups = find_and_delete_duplicates(tmpdir)
print(f"Found and deleted {len(dups)} duplicate(s)")
remaining = list(Path(tmpdir).rglob('*'))
print(f"Remaining files: {[p.name for p in remaining]}")
Output
Deleting duplicate: /tmp/tmpXXXXXX/file2.txt
Found and deleted 1 duplicate(s)
Remaining files: ['file1.txt', 'unique.txt']
How it works
The file_hash function reads the file in fixed-size chunks so it can handle very large files without loading them entirely into memory. SHA256 is used because it is fast and collision-resistant for this use case. By storing the first occurrence of each hash in a dictionary, any subsequent file with the same hash is identified as a duplicate and immediately deleted with unlink(). The script walks the directory recursively using pathlib.Path.rglob('*'), which returns all entries including those in subdirectories.
Common mistakes
- Using a weak or slow hash (e.g. MD5 is fast but not recommended for security, SHA512 is overkill).
- Forgetting to skip directories (calling `file_hash` on a directory raises an error).
- Deleting files without user confirmation in a production script — this example does not prompt.
- Assuming hash alone guarantees identity: two different files can rarely collide, so some use cases may require byte-for-byte comparison after a hash match.
Variations
- Use `os.walk` instead of `pathlib` for older Python versions.
- Keep a list of duplicate file paths instead of deleting immediately, then present them for manual review.
Real-world use cases
- Cleaning up downloaded media libraries (photos, music) where the same file was saved multiple times.
- Removing duplicate logs or backup files that accumulate in application data directories.
- Running as a scheduled maintenance script on shared network drives to reclaim storage space.
Sponsored
More from Automation & scripting
- Batch Rename Hundreds of Files in Python easy
- Build a Command-Line Password Generator in Python easy
- Build a Complete Web Scraper with Requests and BeautifulSoup in Python medium
- Build a Network Ping Monitor in Python medium
- Create a Local Search Engine to Instantly Find Files on Your Computer in Python medium
- Create a Simple HTTP File Server in Python easy
Keep learning
Related tutorials and quizzes for this topic.