How to Find Missing Values in Large Datasets in Python

Analyze missing values across multiple large pandas DataFrames with counts and percentages.

Medium Python 3.9+ Jun 28, 2026 Data pipelines & processing 3 views 0 copies

pandas missing-data data-cleaning data-quality

Requires third-party packages — install first

pip install pandas numpy

Python code

33 lines

Python 3.9+

import pandas as pd
import numpy as np

def find_missing_values_summary(datasets):
    """Analyze missing values across multiple datasets (dict of name: DataFrame)."""
    summary = {}
    for name, df in datasets.items():
        missing_count = df.isnull().sum()
        total_rows = len(df)
        missing_pct = (missing_count / total_rows * 100).round(2)
        
        summary[name] = pd.DataFrame({
            'Column': df.columns,
            'Missing_Count': missing_count.values,
            'Missing_Percentage': missing_pct.values
        })
    return summary

if __name__ == "__main__":
    # Simulate large datasets with missing values
    np.random.seed(42)
    size = 10000
    df1 = pd.DataFrame({'A': np.random.choice([1, np.nan], size=size, p=[0.9, 0.1]),
                        'B': np.random.choice([2, np.nan], size=size, p=[0.8, 0.2])})
    df2 = pd.DataFrame({'X': np.random.choice([10, np.nan], size=size, p=[0.95, 0.05]),
                        'Y': np.random.choice([20, np.nan], size=size, p=[0.85, 0.15])})
    
    datasets = {'sales': df1, 'customers': df2}
    result = find_missing_values_summary(datasets)
    for name, df in result.items():
        print(f"Dataset: {name}")
        print(df.to_string(index=False))
        print()

Output

stdout

Dataset: sales
Column  Missing_Count  Missing_Percentage
     A            996               9.96
     B           2019              20.19

Dataset: customers
Column  Missing_Count  Missing_Percentage
     X            528               5.28
     Y           1474              14.74

How it works

The solution uses df.isnull().sum() to count missing values per column. Dividing by total rows and multiplying by 100 converts counts to percentages. Wrapping this logic in a dictionary iteration makes it easy to compare missingness across related datasets.

Common mistakes

Forgetting to reset the random seed for reproducible results
Assuming missing values are always NaN when they could be empty strings or sentinel values
Using `df.isnull().sum().sum()` by mistake to get a single total rather than per-column counts

Variations

Use `df.isna().sum()` as an alias for the same behavior
Call `df.info()` to see non-null counts and dtypes at a glance for a single DataFrame

Real-world use cases

Running a data quality report before merging source tables from different departments.
Checking for missing fields in customer records before training a machine learning model.
Automating missing-value checks in a daily ETL pipeline to alert on data degradation.

How to Find Missing Values in Large Datasets in Python

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Data pipelines & processing

Tutorials

Quizzes

Python code

Output

How it works

Common mistakes

Variations

Real-world use cases

More from Data pipelines & processing

Keep learning

Tutorials

Quizzes