How to Find Missing Values in Large Datasets in Python
Analyze missing values across multiple large pandas DataFrames with counts and percentages.
Requires third-party packages — install first
pip install pandas numpy
Python code
33 linesimport pandas as pd
import numpy as np
def find_missing_values_summary(datasets):
"""Analyze missing values across multiple datasets (dict of name: DataFrame)."""
summary = {}
for name, df in datasets.items():
missing_count = df.isnull().sum()
total_rows = len(df)
missing_pct = (missing_count / total_rows * 100).round(2)
summary[name] = pd.DataFrame({
'Column': df.columns,
'Missing_Count': missing_count.values,
'Missing_Percentage': missing_pct.values
})
return summary
if __name__ == "__main__":
# Simulate large datasets with missing values
np.random.seed(42)
size = 10000
df1 = pd.DataFrame({'A': np.random.choice([1, np.nan], size=size, p=[0.9, 0.1]),
'B': np.random.choice([2, np.nan], size=size, p=[0.8, 0.2])})
df2 = pd.DataFrame({'X': np.random.choice([10, np.nan], size=size, p=[0.95, 0.05]),
'Y': np.random.choice([20, np.nan], size=size, p=[0.85, 0.15])})
datasets = {'sales': df1, 'customers': df2}
result = find_missing_values_summary(datasets)
for name, df in result.items():
print(f"Dataset: {name}")
print(df.to_string(index=False))
print()
Output
Dataset: sales
Column Missing_Count Missing_Percentage
A 996 9.96
B 2019 20.19
Dataset: customers
Column Missing_Count Missing_Percentage
X 528 5.28
Y 1474 14.74
How it works
The solution uses df.isnull().sum() to count missing values per column. Dividing by total rows and multiplying by 100 converts counts to percentages. Wrapping this logic in a dictionary iteration makes it easy to compare missingness across related datasets.
Common mistakes
- Forgetting to reset the random seed for reproducible results
- Assuming missing values are always NaN when they could be empty strings or sentinel values
- Using `df.isnull().sum().sum()` by mistake to get a single total rather than per-column counts
Variations
- Use `df.isna().sum()` as an alias for the same behavior
- Call `df.info()` to see non-null counts and dtypes at a glance for a single DataFrame
Real-world use cases
- Running a data quality report before merging source tables from different departments.
- Checking for missing fields in customer records before training a machine learning model.
- Automating missing-value checks in a daily ETL pipeline to alert on data degradation.
Sponsored
Sponsored
Reserved space — layout preview until AdSense is connected
More from Data pipelines & processing
Keep learning
Related tutorials and quizzes for this topic.