In data analysis, detecting noisy or outlier values is very important. Outliers can affect accuracy and lead to incorrect conclusions. In this article, we will learn how to identify outliers in an age dataset using two popular methods: Z-score and IQR (Interquartile Range).
This Python program is simple, practical, and useful for students as well as beginners in data analysis.
Python Program
#!/usr/bin/env python3
import numpy as np
from scipy import stats
def detect_outliers_zscore(data, threshold=3.0):
clean = np.array([x for x in data if x > 0], dtype=float)
if len(clean) < 2:
return []
z_scores = np.abs(stats.zscore(clean))
return clean[z_scores > threshold].tolist()
def detect_outliers_iqr(data, k=1.5):
clean = sorted(x for x in data if x > 0)
if len(clean) < 4:
return []
q1 = np.percentile(clean, 25)
q3 = np.percentile(clean, 75)
iqr = q3 - q1
lower = q1 - k * iqr
upper = q3 + k * iqr
return [x for x in clean if x < lower or x > upper]
def main():
age_data = [18, 20, 22, 21, 19, 20, 21, 300, 20, 19, -5, 23, 22, 20]
print("Original ages:", age_data)
print("\nZ-score Outliers:")
print(detect_outliers_zscore(age_data))
print("\nIQR Outliers:")
print(detect_outliers_iqr(age_data))
if __name__ == "__main__":
main()
Explanation
1. Data Cleaning
Before detecting outliers, the program removes invalid values such as negative ages. This ensures more accurate results.
2. Z-Score Method
The Z-score measures how far a value is from the mean. The formula is:
Z = (x − μ) / σ
If the absolute value of Z is greater than 3, it is considered an outlier.
3. IQR Method
The IQR method uses quartiles:
- Q1 = 25th percentile
- Q3 = 75th percentile
- IQR = Q3 − Q1
Any value outside the range:
[Q1 − 1.5 × IQR, Q3 + 1.5 × IQR]
is treated as an outlier.
Sample Output
Conclusion
Both Z-score and IQR methods are effective for detecting noisy values in datasets. Z-score works well for normally distributed data, while IQR is more robust for skewed data.
You can adjust the threshold values to fit your dataset or integrate this logic into larger data-cleaning pipelines.
Frequently Asked Questions (FAQs)
What is an outlier in data?
An outlier is a value that is significantly different from other values in a dataset. It may indicate errors or unusual observations.
Which method is better: Z-score or IQR?
Z-score works well for normally distributed data, while IQR is better for skewed datasets and is more robust against extreme values.
Why remove negative values in age data?
Negative ages are invalid in real-world datasets, so they are removed before analysis to improve accuracy.
Can I change the threshold values?
Yes, you can adjust the Z-score threshold (default 3) and IQR multiplier (default 1.5) depending on your dataset.

Comments
Post a Comment