Identify Noisy (Outlier) Values in an Age Dataset
Below is a Python program that uses both the Z-score and IQR methods to detect outliers:
#!/usr/bin/env python3
"""
Identify noisy (outlier) values in an age dataset using:
1. Z-score method
2. IQR (Interquartile Range) method
"""
import numpy as np
from scipy import stats
def detect_outliers_zscore(data, threshold=3.0):
"""
Returns a list of values with absolute Z-score > threshold.
"""
# Filter out non-positive ages first (clearly invalid)
clean = np.array([x for x in data if x > 0], dtype=float)
if len(clean) < 2:
return []
z_scores = np.abs(stats.zscore(clean))
return clean[z_scores > threshold].tolist()
def detect_outliers_iqr(data, k=1.5):
"""
Returns a list of values outside the [Q1 - k*IQR, Q3 + k*IQR] range.
"""
clean = sorted(x for x in data if x > 0)
if len(clean) < 4:
return [] # need at least 4 points to define IQR
q1 = np.percentile(clean, 25)
q3 = np.percentile(clean, 75)
iqr = q3 - q1
lower_bound = q1 - k * iqr
upper_bound = q3 + k * iqr
return [x for x in clean if x < lower_bound or x > upper_bound]
def main():
# Example age dataset
age_data = [18, 20, 22, 21, 19, 20, 21, 300, 20, 19, -5, 23, 22, 20]
print("Original ages:", age_data)
z_outliers = detect_outliers_zscore(age_data)
print("\nNoisy values by Z-score method (|Z| > 3):")
print(z_outliers or "None detected")
iqr_outliers = detect_outliers_iqr(age_data)
print("\nNoisy values by IQR method (k=1.5):")
print(iqr_outliers or "None detected")
if __name__ == "__main__":
main()
How it works
-
Cleaning
We first strip out any non-positive ages (e.g.,-5
,0
) since those are clearly invalid. -
Z-score Method
-
Computes each age’s Z-score:
-
Flags anything with as an outlier.
-
-
IQR Method
-
Finds the 1st quartile (Q1) and 3rd quartile (Q3).
-
Computes the IQR = Q3 − Q1.
-
Anything outside is an outlier.
-
Sample Output
Feel free to tweak the thresholds (threshold
in Z-score, k
in IQR) or integrate this into a larger data-cleaning pipeline!
Comments
Post a Comment