Detect Outliers in CSV Data Using Z-Score in Python
Read a CSV file and detect outliers in a numeric column by computing z-scores, flagging those exceeding a given threshold — no machine learning required.
Python code
43 linesimport csv
import statistics
from math import sqrt
def detect_outliers(csv_path, column_name, threshold=2.0):
"""Detect outliers in a numeric column using z-score method."""
values = []
with open(csv_path, 'r', newline='') as f:
reader = csv.DictReader(f)
if column_name not in reader.fieldnames:
raise ValueError(f"Column '{column_name}' not found")
for row in reader:
try:
val = float(row[column_name])
values.append(val)
except (ValueError, TypeError):
continue
if len(values) < 2:
return []
mean = statistics.mean(values)
stdev = statistics.stdev(values)
if stdev == 0:
return []
outliers = []
for i, val in enumerate(values):
z_score = (val - mean) / stdev
if abs(z_score) > threshold:
outliers.append((i, val, z_score))
return outliers
if __name__ == "__main__":
# Example: create sample CSV data
sample_data = "value\n10\n12\n11\n13\n100\n9\n11\n12\n10\n200\n14\n"
with open('sample.csv', 'w') as f:
f.write(sample_data)
result = detect_outliers('sample.csv', 'value')
print("Outliers detected (index, value, z-score):")
for idx, val, z in result:
print(f" Row {idx+1}: {val} (z={z:.2f})")
Output
Outliers detected (index, value, z-score):
Row 6: 100 (z=2.37)
Row 11: 200 (z=4.68)
How it works
The function reads a CSV file with csv.DictReader to access columns by name. It parses only valid numeric values, skipping errors via try/except. Z-scores measure how many standard deviations a value is from the mean; a common threshold is 2 or 3. Values with abs(z-score) > threshold are flagged as outliers. Using the standard library avoids external dependencies while providing a simple, transparent outlier detection mechanism suitable for quick data screening.
Common mistakes
- Forgetting to handle non-numeric or missing data, which crashes the script.
- Using `statistics.pstdev` instead of `statistics.stdev` for sample standard deviation.
- Setting the threshold too low (e.g., 1.5) and flagging normal variation as outliers.
Variations
- Use `pandas` with `scipy.stats.zscore` for vectorized outlier detection on large datasets.
- Apply the IQR method: flag values below Q1 – 1.5*IQR or above Q3 + 1.5*IQR.
Real-world use cases
- Flag abnormal sensor readings in an IoT data pipeline before alerting operators.
- Identify anomalous transaction amounts in financial log reviews to reduce fraud investigation scope.
- Quickly spot data entry errors in survey results before running statistical analysis.
Sponsored
More from Files & data
- Audit File Permissions Across a Project in Python easy
- Automatically Detect Corrupted Files Using SHA-256 Checksums in Python easy
- Automatically Highlight Data Validation Errors Inside Excel Files in Python easy
- Build a Command-Line To-Do List Application with Data Persistence in Python easy
- Build a Personal Work Hours Tracker in Python medium
- Build a Python Script That Detects and Deletes Empty Files Across Folders easy
Keep learning
Related tutorials and quizzes for this topic.