Python Program to Identify Noisy Values in Age Data

Identify Noisy (Outlier) Values in an Age Dataset

Below is a Python program that uses both the Z-score and IQR methods to detect outliers:


#!/usr/bin/env python3
"""
Identify noisy (outlier) values in an age dataset using:
 1. Z-score method
 2. IQR (Interquartile Range) method
"""

import numpy as np
from scipy import stats

def detect_outliers_zscore(data, threshold=3.0):
    """
    Returns a list of values with absolute Z-score > threshold.
    """
    # Filter out non-positive ages first (clearly invalid)
    clean = np.array([x for x in data if x > 0], dtype=float)
    if len(clean) < 2:
        return []
    z_scores = np.abs(stats.zscore(clean))
    return clean[z_scores > threshold].tolist()

def detect_outliers_iqr(data, k=1.5):
    """
    Returns a list of values outside the [Q1 - k*IQR, Q3 + k*IQR] range.
    """
    clean = sorted(x for x in data if x > 0)
    if len(clean) < 4:
        return []  # need at least 4 points to define IQR
    q1 = np.percentile(clean, 25)
    q3 = np.percentile(clean, 75)
    iqr = q3 - q1
    lower_bound = q1 - k * iqr
    upper_bound = q3 + k * iqr
    return [x for x in clean if x < lower_bound or x > upper_bound]

def main():
    # Example age dataset
    age_data = [18, 20, 22, 21, 19, 20, 21, 300, 20, 19, -5, 23, 22, 20]

    print("Original ages:", age_data)

    z_outliers = detect_outliers_zscore(age_data)
    print("\nNoisy values by Z-score method (|Z| > 3):")
    print(z_outliers or "None detected")

    iqr_outliers = detect_outliers_iqr(age_data)
    print("\nNoisy values by IQR method (k=1.5):")
    print(iqr_outliers or "None detected")

if __name__ == "__main__":
    main()
  

How it works

  1. Cleaning
    We first strip out any non-positive ages (e.g., -5, 0) since those are clearly invalid.

  2. Z-score Method

    • Computes each age’s Z-score:

      Z=xμσ Z = \frac{x - \mu}{\sigma}
    • Flags anything with Z>3|Z| > 3 as an outlier.

  3. IQR Method

    • Finds the 1st quartile (Q1) and 3rd quartile (Q3).

    • Computes the IQR = Q3 − Q1.

    • Anything outside [Q11.5IQR,  Q3+1.5IQR][Q1 − 1.5\,\mathrm{IQR},\; Q3 + 1.5\,\mathrm{IQR}] is an outlier.


Sample Output

Original ages: [18, 20, 22, 21, 19, 20, 21, 300, 20, 19, -5, 23, 22, 20]
Noisy values by Z-score method (|Z| > 3): [300.0] Noisy values by IQR method (k=1.5): [300]

Feel free to tweak the thresholds (threshold in Z-score, k in IQR) or integrate this into a larger data-cleaning pipeline!

Comments