Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

Find and Delete Duplicate Files Using Hashing in Python

Walk a directory tree, compute SHA256 hashes for every file, and delete duplicates that share the same hash.

Medium Python 3.8+ Jun 27, 2026 Automation & scripting 1 views 0 copies

Python code

45 lines
Python 3.8+
import hashlib
import os
from pathlib import Path

def file_hash(path, block_size=65536):
    """Return SHA256 hash of file content."""
    hasher = hashlib.sha256()
    with open(path, 'rb') as f:
        while chunk := f.read(block_size):
            hasher.update(chunk)
    return hasher.hexdigest()

def find_and_delete_duplicates(directory):
    """Find and delete duplicate files in directory (recursively)."""
    hashes = {}
    duplicates = []
    for filepath in Path(directory).rglob('*'):
        if filepath.is_file():
            h = file_hash(filepath)
            if h in hashes:
                duplicates.append(filepath)
                print(f"Deleting duplicate: {filepath}")
                filepath.unlink()
            else:
                hashes[h] = filepath
    return duplicates

if __name__ == "__main__":
    # Example: create test files in a temporary directory
    import tempfile
    import time
    with tempfile.TemporaryDirectory() as tmpdir:
        # Create two identical files
        f1 = Path(tmpdir) / "file1.txt"
        f2 = Path(tmpdir) / "file2.txt"
        f1.write_text("Hello World")
        time.sleep(0.1)  # Ensure different timestamps
        f2.write_text("Hello World")
        # Create a unique file
        f3 = Path(tmpdir) / "unique.txt"
        f3.write_text("Unique content")
        dups = find_and_delete_duplicates(tmpdir)
        print(f"Found and deleted {len(dups)} duplicate(s)")
        remaining = list(Path(tmpdir).rglob('*'))
        print(f"Remaining files: {[p.name for p in remaining]}")

Output

stdout
Deleting duplicate: /tmp/tmpXXXXXX/file2.txt
Found and deleted 1 duplicate(s)
Remaining files: ['file1.txt', 'unique.txt']

How it works

The file_hash function reads the file in fixed-size chunks so it can handle very large files without loading them entirely into memory. SHA256 is used because it is fast and collision-resistant for this use case. By storing the first occurrence of each hash in a dictionary, any subsequent file with the same hash is identified as a duplicate and immediately deleted with unlink(). The script walks the directory recursively using pathlib.Path.rglob('*'), which returns all entries including those in subdirectories.

Common mistakes

  • Using a weak or slow hash (e.g. MD5 is fast but not recommended for security, SHA512 is overkill).
  • Forgetting to skip directories (calling `file_hash` on a directory raises an error).
  • Deleting files without user confirmation in a production script — this example does not prompt.
  • Assuming hash alone guarantees identity: two different files can rarely collide, so some use cases may require byte-for-byte comparison after a hash match.

Variations

  1. Use `os.walk` instead of `pathlib` for older Python versions.
  2. Keep a list of duplicate file paths instead of deleting immediately, then present them for manual review.

Real-world use cases

  • Cleaning up downloaded media libraries (photos, music) where the same file was saved multiple times.
  • Removing duplicate logs or backup files that accumulate in application data directories.
  • Running as a scheduled maintenance script on shared network drives to reclaim storage space.

Sponsored

Sponsored Reserved space — layout preview until AdSense is connected

Run this sample

Open the browser IDE to tweak the example and see results without installing anything.

Open editor

More from Automation & scripting

Related tutorials and quizzes for this topic.