Find and Delete Duplicate Files Using Hashing in Python

Walk a directory tree, compute SHA256 hashes for every file, and delete duplicates that share the same hash.

Medium Python 3.8+ Jun 27, 2026 Automation & scripting 1 views 0 copies

deduplication files hashing sha256 pathlib automation

Python code

45 lines

Python 3.8+

import hashlib
import os
from pathlib import Path

def file_hash(path, block_size=65536):
    """Return SHA256 hash of file content."""
    hasher = hashlib.sha256()
    with open(path, 'rb') as f:
        while chunk := f.read(block_size):
            hasher.update(chunk)
    return hasher.hexdigest()

def find_and_delete_duplicates(directory):
    """Find and delete duplicate files in directory (recursively)."""
    hashes = {}
    duplicates = []
    for filepath in Path(directory).rglob('*'):
        if filepath.is_file():
            h = file_hash(filepath)
            if h in hashes:
                duplicates.append(filepath)
                print(f"Deleting duplicate: {filepath}")
                filepath.unlink()
            else:
                hashes[h] = filepath
    return duplicates

if __name__ == "__main__":
    # Example: create test files in a temporary directory
    import tempfile
    import time
    with tempfile.TemporaryDirectory() as tmpdir:
        # Create two identical files
        f1 = Path(tmpdir) / "file1.txt"
        f2 = Path(tmpdir) / "file2.txt"
        f1.write_text("Hello World")
        time.sleep(0.1)  # Ensure different timestamps
        f2.write_text("Hello World")
        # Create a unique file
        f3 = Path(tmpdir) / "unique.txt"
        f3.write_text("Unique content")
        dups = find_and_delete_duplicates(tmpdir)
        print(f"Found and deleted {len(dups)} duplicate(s)")
        remaining = list(Path(tmpdir).rglob('*'))
        print(f"Remaining files: {[p.name for p in remaining]}")

Output

stdout

Deleting duplicate: /tmp/tmpXXXXXX/file2.txt
Found and deleted 1 duplicate(s)
Remaining files: ['file1.txt', 'unique.txt']

How it works

The file_hash function reads the file in fixed-size chunks so it can handle very large files without loading them entirely into memory. SHA256 is used because it is fast and collision-resistant for this use case. By storing the first occurrence of each hash in a dictionary, any subsequent file with the same hash is identified as a duplicate and immediately deleted with unlink(). The script walks the directory recursively using pathlib.Path.rglob('*'), which returns all entries including those in subdirectories.

Common mistakes

Using a weak or slow hash (e.g. MD5 is fast but not recommended for security, SHA512 is overkill).
Forgetting to skip directories (calling `file_hash` on a directory raises an error).
Deleting files without user confirmation in a production script — this example does not prompt.
Assuming hash alone guarantees identity: two different files can rarely collide, so some use cases may require byte-for-byte comparison after a hash match.

Variations

Use `os.walk` instead of `pathlib` for older Python versions.
Keep a list of duplicate file paths instead of deleting immediately, then present them for manual review.

Real-world use cases

Cleaning up downloaded media libraries (photos, music) where the same file was saved multiple times.
Removing duplicate logs or backup files that accumulate in application data directories.
Running as a scheduled maintenance script on shared network drives to reclaim storage space.

Find and Delete Duplicate Files Using Hashing in Python

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Automation & scripting

Tutorials

Quizzes

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Automation & scripting

Keep learning

Tutorials

Quizzes