Maintenance

Site is under maintenance — quizzes are still available.

Go to quizzes
Sponsored Reserved space — layout preview until AdSense is connected

How to Find Missing Values in Large Datasets in Python

Analyze missing values across multiple large pandas DataFrames with counts and percentages.

Medium Python 3.9+ Jun 28, 2026 Data pipelines & processing 3 views 0 copies

Requires third-party packages — install first
pip install pandas numpy

Python code

33 lines
Python 3.9+
import pandas as pd
import numpy as np

def find_missing_values_summary(datasets):
    """Analyze missing values across multiple datasets (dict of name: DataFrame)."""
    summary = {}
    for name, df in datasets.items():
        missing_count = df.isnull().sum()
        total_rows = len(df)
        missing_pct = (missing_count / total_rows * 100).round(2)
        
        summary[name] = pd.DataFrame({
            'Column': df.columns,
            'Missing_Count': missing_count.values,
            'Missing_Percentage': missing_pct.values
        })
    return summary

if __name__ == "__main__":
    # Simulate large datasets with missing values
    np.random.seed(42)
    size = 10000
    df1 = pd.DataFrame({'A': np.random.choice([1, np.nan], size=size, p=[0.9, 0.1]),
                        'B': np.random.choice([2, np.nan], size=size, p=[0.8, 0.2])})
    df2 = pd.DataFrame({'X': np.random.choice([10, np.nan], size=size, p=[0.95, 0.05]),
                        'Y': np.random.choice([20, np.nan], size=size, p=[0.85, 0.15])})
    
    datasets = {'sales': df1, 'customers': df2}
    result = find_missing_values_summary(datasets)
    for name, df in result.items():
        print(f"Dataset: {name}")
        print(df.to_string(index=False))
        print()

Output

stdout
Dataset: sales
Column  Missing_Count  Missing_Percentage
     A            996               9.96
     B           2019              20.19

Dataset: customers
Column  Missing_Count  Missing_Percentage
     X            528               5.28
     Y           1474              14.74

How it works

The solution uses df.isnull().sum() to count missing values per column. Dividing by total rows and multiplying by 100 converts counts to percentages. Wrapping this logic in a dictionary iteration makes it easy to compare missingness across related datasets.

Common mistakes

  • Forgetting to reset the random seed for reproducible results
  • Assuming missing values are always NaN when they could be empty strings or sentinel values
  • Using `df.isnull().sum().sum()` by mistake to get a single total rather than per-column counts

Variations

  1. Use `df.isna().sum()` as an alias for the same behavior
  2. Call `df.info()` to see non-null counts and dtypes at a glance for a single DataFrame

Real-world use cases

  • Running a data quality report before merging source tables from different departments.
  • Checking for missing fields in customer records before training a machine learning model.
  • Automating missing-value checks in a daily ETL pipeline to alert on data degradation.

Sponsored

Sponsored Reserved space — layout preview until AdSense is connected

Run locally

This sample needs third-party packages, so it cannot run in the browser IDE. Copy the code above, install the packages shown at the top, then run it in your own Python environment.

More from Data pipelines & processing

Related tutorials and quizzes for this topic.