Identify Noisy (Outlier) Values in Python using Z-Score and IQR Methods

Identify Noisy (Outlier) Values in Python using Z-Score and IQR Methods

In data analysis, detecting noisy or outlier values is very important. Outliers can affect accuracy and lead to incorrect conclusions. In this article, we will learn how to identify outliers in an age dataset using two popular methods: Z-score and IQR (Interquartile Range).

This Python program is simple, practical, and useful for students as well as beginners in data analysis.


Python Program

#!/usr/bin/env python3

import numpy as np
from scipy import stats

def detect_outliers_zscore(data, threshold=3.0):
    clean = np.array([x for x in data if x > 0], dtype=float)
    if len(clean) < 2:
        return []
    z_scores = np.abs(stats.zscore(clean))
    return clean[z_scores > threshold].tolist()

def detect_outliers_iqr(data, k=1.5):
    clean = sorted(x for x in data if x > 0)
    if len(clean) < 4:
        return []
    q1 = np.percentile(clean, 25)
    q3 = np.percentile(clean, 75)
    iqr = q3 - q1
    lower = q1 - k * iqr
    upper = q3 + k * iqr
    return [x for x in clean if x < lower or x > upper]

def main():
    age_data = [18, 20, 22, 21, 19, 20, 21, 300, 20, 19, -5, 23, 22, 20]

    print("Original ages:", age_data)

    print("\nZ-score Outliers:")
    print(detect_outliers_zscore(age_data))

    print("\nIQR Outliers:")
    print(detect_outliers_iqr(age_data))

if __name__ == "__main__":
    main()

Explanation

1. Data Cleaning

Before detecting outliers, the program removes invalid values such as negative ages. This ensures more accurate results.

2. Z-Score Method

The Z-score measures how far a value is from the mean. The formula is:

Z = (x − μ) / σ

If the absolute value of Z is greater than 3, it is considered an outlier.

3. IQR Method

The IQR method uses quartiles:

  • Q1 = 25th percentile
  • Q3 = 75th percentile
  • IQR = Q3 − Q1

Any value outside the range:

[Q1 − 1.5 × IQR, Q3 + 1.5 × IQR]

is treated as an outlier.


Sample Output

Original ages: [18, 20, 22, 21, 19, 20, 21, 300, 20, 19, -5, 23, 22, 20] Z-score Outliers: [300.0] IQR Outliers: [300]

Conclusion

Both Z-score and IQR methods are effective for detecting noisy values in datasets. Z-score works well for normally distributed data, while IQR is more robust for skewed data.

You can adjust the threshold values to fit your dataset or integrate this logic into larger data-cleaning pipelines.


Frequently Asked Questions (FAQs)

What is an outlier in data?

An outlier is a value that is significantly different from other values in a dataset. It may indicate errors or unusual observations.

Which method is better: Z-score or IQR?

Z-score works well for normally distributed data, while IQR is better for skewed datasets and is more robust against extreme values.

Why remove negative values in age data?

Negative ages are invalid in real-world datasets, so they are removed before analysis to improve accuracy.

Can I change the threshold values?

Yes, you can adjust the Z-score threshold (default 3) and IQR multiplier (default 1.5) depending on your dataset.


Read More

Comments