Outliers- a must know concept in Data-Industry

Outliers are data points that deviate significantly from the rest of the data. They can occur due to various reasons such as measurement errors, natural variations in the data, or rare events. Outliers can have a significant impact on statistical analyses and can lead to incorrect conclusions if not handled appropriately.

In this blog post, we will discuss the types of outliers, how to handle them, and the interquartile range (IQR) method for detecting outliers.

Types of Outliers

There are two main types of outliers:

  1. Univariate Outliers: Univariate outliers occur when a data point is significantly different from the other data points in the same variable. For example, if we have a dataset of student grades, a univariate outlier might be a student who scored much higher or lower than the rest of the students.

  2. Multivariate Outliers: Multivariate outliers occur when a data point is an extreme outlier not only in its variable but also in relation to other variables. For example, if we have a dataset of student grades and their ages, a multivariate outlier might be a student who scored much higher or lower than the rest of the students in their age group.

Handling Outliers

There are several methods for handling outliers, including:

  1. Removing Outliers: One way to handle outliers is to remove them from the dataset. However, this method can lead to loss of information and can also bias the results if the outliers are not random. Therefore, it is important to carefully consider the reasons for the outliers and the impact of their removal on the analysis.

  2. Transforming Data: Another method for handling outliers is to transform the data. This can be done by using mathematical functions such as logarithmic or square root transformations. Transforming the data can help to reduce the impact of outliers and improve the normality of the data.

  3. Robust Methods: Robust methods are statistical methods that are less affected by outliers. For example, the median is a robust measure of central tendency that is less affected by outliers than the mean. Similarly, the interquartile range (IQR) is a robust measure of spread that can be used to detect outliers.

Interquartile Range (IQR) Method

The IQR method is a simple and robust method for detecting outliers. It involves calculating the IQR, which is the difference between the 75th and 25th percentiles of the data. Any data point that falls below Q1 - 1.5IQR or above Q3 + 1.5IQR is considered an outlier.

Let's implement the IQR method in Python.

We will use the numpy library to generate a random dataset and the scipy library to calculate the IQR and detect outliers.

import numpy as np
from scipy.stats import iqr

# Generate a random dataset
data = np.random.normal(size=1000)

# Calculate the IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1

# Detect outliers
lower_bound = q1 - 1.5*iqr
upper_bound = q3 + 1.5*iqr
outliers = [x for x in data if x < lower_bound or x > upper_bound]

# Print the number of outliers
print(f"Number of outliers: {len(outliers)}")

In this code, we first generate a random dataset of 1000 normally distributed data points using the numpy library. We then calculate the IQR using the percentile function from numpy. Finally, we detect outliers using the lower and upper bounds and count the number of outliers.

Let's see how can we remove the outliers using iqr method.

import numpy as np

# Generate a sample dataset
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100, 101, 102, 103])

# Calculate the interquartile range (IQR)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1

# Calculate the lower and upper bounds
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Remove the outliers
data = data[(data >= lower_bound) & (data <= upper_bound)]

print(data)

In this code, we first generate a sample dataset with 14 values, including two outliers. We then calculate the interquartile range (IQR) using the percentile function from NumPy. We use the IQR to calculate the lower and upper bounds for the dataset. We then remove the outliers by selecting only the values within the bounds using Boolean indexing.

Output:

 [ 1  2  3  4  5  6  7  8  9 10]

In this output, we can see that the outliers with values of 100, 101, 102, and 103 have been removed, and only the non-outlier values remain.

Conclusion

Outliers can have a significant impact on the results of statistical analyses and machine learning models. Therefore, it is important to identify and handle outliers appropriately. There are several methods for identifying and handling outliers, including removing outliers, imputing outliers, transforming data, and using robust statistical methods. The choice of method depends on the nature of the data and the specific analysis or modeling task. By properly handling outliers, we can ensure that our results are accurate and reliable. Hope you got value out this article. Subscribe to the newletter to get more such blogs.

Thanks :)