How to deal with outliers in a dataset using Python?
In the first example, we are going to use the Z-score method to identify and deal with outliers in a dataset using Python.
In this example, we first import the necessary libraries, create a sample dataset with outliers, calculate the Z-scores for the data points, set a threshold for identifying outliers, find the indices of the outliers, and finally remove the outliers from the dataset. In the second example, we are going to use the IQR (Interquartile Range) method to identify and deal with outliers in a dataset using Python.
In this example, we create a sample dataset with outliers, calculate the first and third quartiles, calculate the IQR, set a threshold for identifying outliers, find the indices of the outliers, and remove the outliers from the dataset using the IQR method.
import numpy as np from scipy import stats # Create a sample dataset with outliers data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100]) # Calculate the Z-scores for the data points z_scores = np.abs(stats.zscore(data)) # Set a threshold for identifying outliers (e.g., Z-score > 3) threshold = 3 # Find the indices of the outliers outlier_indices = np.where(z_scores > threshold) # Remove the outliers from the dataset cleaned_data = np.delete(data, outlier_indices) print("Original data:", data) print("Cleaned data:", cleaned_data)
In this example, we first import the necessary libraries, create a sample dataset with outliers, calculate the Z-scores for the data points, set a threshold for identifying outliers, find the indices of the outliers, and finally remove the outliers from the dataset. In the second example, we are going to use the IQR (Interquartile Range) method to identify and deal with outliers in a dataset using Python.
# Create a sample dataset with outliers data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 100]) # Calculate the first and third quartiles Q1 = np.percentile(data, 25) Q3 = np.percentile(data, 75) # Calculate the IQR IQR = Q3 - Q1 # Set a threshold for identifying outliers (e.g., 1.5 times IQR) threshold = 1.5 * IQR # Find the indices of the outliers outlier_indices = np.where((data < Q1 - threshold) | (data > Q3 + threshold)) # Remove the outliers from the dataset cleaned_data = np.delete(data, outlier_indices) print("Original data:", data) print("Cleaned data:", cleaned_data)
In this example, we create a sample dataset with outliers, calculate the first and third quartiles, calculate the IQR, set a threshold for identifying outliers, find the indices of the outliers, and remove the outliers from the dataset using the IQR method.
Comments
Post a Comment