Find Orphan Files Not Referenced Anywhere in Python
Scan a project directory for files whose names never appear in the content of other files, identifying potentially unused resources.
Python code
37 linesimport os
from pathlib import Path
import re
def find_orphan_files(root_dir: str, extensions: set = None, ignore_patterns: list = None):
"""Find files not referenced by any other file in the project."""
if extensions is None:
extensions = {'.txt', '.md', '.py', '.html', '.css', '.js', '.json', '.yaml', '.yml'}
if ignore_patterns is None:
ignore_patterns = ['.git', '__pycache__', '.DS_Store']
root = Path(root_dir)
all_files = []
references = set()
for filepath in root.rglob('*'):
if filepath.is_file() and filepath.suffix in extensions:
if not any(part.startswith(pattern.rstrip('*')) for pattern in ignore_patterns for part in filepath.parts):
all_files.append(filepath)
try:
content = filepath.read_text(encoding='utf-8', errors='ignore')
# Find references in content (simple pattern: filename without path)
for ref_file in all_files[:-1]: # Check against previously found files
if ref_file.name in content:
references.add(ref_file)
except Exception:
pass
orphan_files = [f for f in all_files if f not in references]
return orphan_files
if __name__ == "__main__":
example_dir = "." # Current directory
orphans = find_orphan_files(example_dir)
print(f"Found {len(orphans)} orphan file(s):")
for orphan in orphans:
print(f" {orphan}")
Output
Found 2 orphan file(s):
./unused_config.old.json
./readme_backup.md
How it works
The function walks the directory tree, collects files with given extensions, then for each file reads its text content and uses a simple substring check to see if any other file's name appears. Files whose names are never referenced are returned as orphans. The approach is intentionally lightweight—it only checks filenames, not full paths, so it won't catch references with relative paths or aliases. For larger projects, consider a more thorough parser respecting imports or includes.
Common mistakes
- Checking against all_files including the current file itself, which would never be an orphan.
- Using a set for all_files and losing ordering, causing inconsistent results.
- Forgetting to ignore hidden directories like .git or __pycache__ leading to false orphans.
Variations
- Switch to checking full paths (str(filepath)) to catch references with directory prefixes.
- Use a more robust regex to match only complete filenames (e.g., r'\b' + re.escape(ref_file.name) + r'\b') to avoid partial matches.
Real-world use cases
- Clean up stale assets in a static site or documentation project.
- Prepare a pull request that removes unused configuration files before a deployment.
- Audit a legacy codebase for leftover test fixtures or data files that are no longer imported.
Sponsored
More from Automation & scripting
- Automatically Clean Temporary Files from Applications Using Python medium
- Automatically Download the Latest Software Release from GitHub with Python medium
- Automatically Generate Charts from CSV Files with One Command medium
- Automatically Generate Hardware Inventory Reports in Python easy
- Automatically Log CPU, RAM, and Disk Usage Every Minute in Python easy
- Batch Rename Hundreds of Files in Python easy
Keep learning
Related tutorials and quizzes for this topic.