How to deal with outliers in a dataset using Python?

In the first example, we are going to use the Z-score method to identify and deal with outliers in a dataset using Python.
import numpy as np
from scipy import stats

# Create a sample dataset with outliers
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])

# Calculate the Z-scores for the data points
z_scores = np.abs(stats.zscore(data))

# Set a threshold for identifying outliers (e.g., Z-score > 3)
threshold = 3

# Find the indices of the outliers
outlier_indices = np.where(z_scores > threshold)

# Remove the outliers from the dataset
cleaned_data = np.delete(data, outlier_indices)

print("Original data:", data)
print("Cleaned data:", cleaned_data)

In this example, we first import the necessary libraries, create a sample dataset with outliers, calculate the Z-scores for the data points, set a threshold for identifying outliers, find the indices of the outliers, and finally remove the outliers from the dataset. In the second example, we are going to use the IQR (Interquartile Range) method to identify and deal with outliers in a dataset using Python.
# Create a sample dataset with outliers
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100])

# Calculate the first and third quartiles
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)

# Calculate the IQR
IQR = Q3 - Q1

# Set a threshold for identifying outliers (e.g., 1.5 times IQR)
threshold = 1.5 * IQR

# Find the indices of the outliers
outlier_indices = np.where((data < Q1 - threshold) | (data > Q3 + threshold))

# Remove the outliers from the dataset
cleaned_data = np.delete(data, outlier_indices)

print("Original data:", data)
print("Cleaned data:", cleaned_data)

In this example, we create a sample dataset with outliers, calculate the first and third quartiles, calculate the IQR, set a threshold for identifying outliers, find the indices of the outliers, and remove the outliers from the dataset using the IQR method.

Comments

Popular posts from this blog

What are the different types of optimization algorithms used in deep learning?

What are the different evaluation metrics used in machine learning?

What is the difference between a module and a package in Python?