Build a Python Utility That Detects Duplicate Records Across Multiple Excel Sheets
A Python utility that uses pandas to find overlapping records across different Excel sheets based on specified key columns.
pip install pandas openpyxl
Python code
44 linesimport pandas as pd
from pathlib import Path
def find_duplicate_records_across_sheets(file_path: str, key_columns: list, sheet_names: list) -> dict:
"""
Detect duplicate records across multiple Excel sheets based on specified key columns.
Args:
file_path: Path to the Excel file
key_columns: List of column names to check for duplicates
sheet_names: List of sheet names to compare
Returns:
Dictionary with duplicate records per sheet pair
"""
xls = pd.ExcelFile(file_path)
duplicates = {}
for i, sheet1 in enumerate(sheet_names):
df1 = pd.read_excel(xls, sheet_name=sheet1)
for j, sheet2 in enumerate(sheet_names):
if i >= j:
continue
df2 = pd.read_excel(xls, sheet_name=sheet2)
merged = df1.merge(df2, on=key_columns, how='inner', suffixes=(f'_{sheet1}', f'_{sheet2}'))
if not merged.empty:
duplicates[f'{sheet1}_vs_{sheet2}'] = merged
return duplicates
if __name__ == "__main__":
sample_file = Path("sample_data.xlsx")
if sample_file.exists():
result = find_duplicate_records_across_sheets(
"sample_data.xlsx",
key_columns=["ID"],
sheet_names=["Sheet1", "Sheet2", "Sheet3"]
)
for pair, records in result.items():
print(f"Duplicate records in {pair}:")
print(records.to_string(index=False))
print()
else:
print("Create sample_data.xlsx with multiple sheets for testing.")
Output
Duplicate records in Sheet1_vs_Sheet2:
ID Name_Sheet1 Age_Sheet1 Name_Sheet2 Age_Sheet2
1 Alice 30 Alice 30
2 Bob 25 Bob 25
Duplicate records in Sheet1_vs_Sheet3:
ID Name_Sheet1 Age_Sheet1 Name_Sheet3 Age_Sheet3
3 Charlie 35 Charlie 35
How it works
The function loads an Excel file once with pd.ExcelFile for efficiency, then iterates over unique sheet pairs (using i >= j to avoid duplicates and self-comparison). For each pair, an inner merge on the specified key columns identifies records that exist in both sheets. The suffixes f'_{sheet1}' and f'_{sheet2}' prevent column name collisions when sheets share column names. The result is a dictionary of DataFrames, one per sheet pair, containing the full duplicate records.
Common mistakes
- Forgetting to install `openpyxl` or `xlrd` as the Excel engine
- Not excluding self-comparisons (Sheet1 vs Sheet1) leading to redundant output
- Assuming exact column name matches across sheets without handling case or whitespace
- Using a single key column when data requires composite keys for uniqueness
Variations
- Use `df1[df1.duplicated(subset=key_columns, keep=False)]` to find duplicates within a single sheet instead of across sheets
- Output duplicates to a consolidated CSV file with `pd.concat(result.values()).to_csv('duplicates.csv', index=False)`
Real-world use cases
- Reconciling monthly sales data where customer records may appear across multiple regional sheets.
- Auditing employee databases from different departments stored in separate Excel tabs.
- Merging inventory lists from multiple warehouses before running a stock redistribution report.
Sponsored
More from Data pipelines & processing
Keep learning
Related tutorials and quizzes for this topic.