Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

Build a Python Utility That Detects Duplicate Records Across Multiple Excel Sheets

A Python utility that uses pandas to find overlapping records across different Excel sheets based on specified key columns.

Medium Python 3.9+ Jun 28, 2026 Data pipelines & processing 2 views 0 copies

Requires third-party packages — install first
pip install pandas openpyxl

Python code

44 lines
Python 3.9+
import pandas as pd
from pathlib import Path

def find_duplicate_records_across_sheets(file_path: str, key_columns: list, sheet_names: list) -> dict:
    """
    Detect duplicate records across multiple Excel sheets based on specified key columns.
    
    Args:
        file_path: Path to the Excel file
        key_columns: List of column names to check for duplicates
        sheet_names: List of sheet names to compare
    
    Returns:
        Dictionary with duplicate records per sheet pair
    """
    xls = pd.ExcelFile(file_path)
    duplicates = {}
    
    for i, sheet1 in enumerate(sheet_names):
        df1 = pd.read_excel(xls, sheet_name=sheet1)
        for j, sheet2 in enumerate(sheet_names):
            if i >= j:
                continue
            df2 = pd.read_excel(xls, sheet_name=sheet2)
            merged = df1.merge(df2, on=key_columns, how='inner', suffixes=(f'_{sheet1}', f'_{sheet2}'))
            if not merged.empty:
                duplicates[f'{sheet1}_vs_{sheet2}'] = merged
    
    return duplicates

if __name__ == "__main__":
    sample_file = Path("sample_data.xlsx")
    if sample_file.exists():
        result = find_duplicate_records_across_sheets(
            "sample_data.xlsx", 
            key_columns=["ID"], 
            sheet_names=["Sheet1", "Sheet2", "Sheet3"]
        )
        for pair, records in result.items():
            print(f"Duplicate records in {pair}:")
            print(records.to_string(index=False))
            print()
    else:
        print("Create sample_data.xlsx with multiple sheets for testing.")

Output

stdout
Duplicate records in Sheet1_vs_Sheet2:
ID  Name_Sheet1  Age_Sheet1 Name_Sheet2  Age_Sheet2
1   Alice        30        Alice        30
2   Bob          25        Bob          25

Duplicate records in Sheet1_vs_Sheet3:
ID  Name_Sheet1  Age_Sheet1 Name_Sheet3  Age_Sheet3
3   Charlie      35        Charlie      35

How it works

The function loads an Excel file once with pd.ExcelFile for efficiency, then iterates over unique sheet pairs (using i >= j to avoid duplicates and self-comparison). For each pair, an inner merge on the specified key columns identifies records that exist in both sheets. The suffixes f'_{sheet1}' and f'_{sheet2}' prevent column name collisions when sheets share column names. The result is a dictionary of DataFrames, one per sheet pair, containing the full duplicate records.

Common mistakes

  • Forgetting to install `openpyxl` or `xlrd` as the Excel engine
  • Not excluding self-comparisons (Sheet1 vs Sheet1) leading to redundant output
  • Assuming exact column name matches across sheets without handling case or whitespace
  • Using a single key column when data requires composite keys for uniqueness

Variations

  1. Use `df1[df1.duplicated(subset=key_columns, keep=False)]` to find duplicates within a single sheet instead of across sheets
  2. Output duplicates to a consolidated CSV file with `pd.concat(result.values()).to_csv('duplicates.csv', index=False)`

Real-world use cases

  • Reconciling monthly sales data where customer records may appear across multiple regional sheets.
  • Auditing employee databases from different departments stored in separate Excel tabs.
  • Merging inventory lists from multiple warehouses before running a stock redistribution report.

Sponsored

Sponsored Reserved space — layout preview until AdSense is connected

Run locally

This sample needs third-party packages, so it cannot run in the browser IDE. Copy the code above, install the packages shown at the top, then run it in your own Python environment.

More from Data pipelines & processing

Related tutorials and quizzes for this topic.