In this article, I will demonstrate how to detect and remove outliers from your dataset. What is an outlier? In statistics, an outlier is a data point that differs significantly from other observations. Outlier Detection can only be applied to continuous features and it doesn't make much sense using it on categorical features. Before performing any outlier analysis make sure there are no missing values in your dataset.

Matplotlib provides many methods to visualize outliers in a dataset, like box-plot and scatter-plot(1-Dimension and 2-Dimensions)

**BOX-PLOT:**

`import matplotlib.pyplot as plt`

%matplotlib inline

plt.boxplot(df.Age)

**Scatter-Plot(1-Dimensional & 2-Dimensional)**

import matplotlib.pyplot as plt

%matplotlib inlineplt.scatter(df.Age,np.zeros_like(df.Age))

import matplotlib.pyplot as plt

%matplotlib inlineplt.scatter(df.Age,df.Survived)

Some of the methods that I will use in this article are (i)**IQR(Inter Quartile Range)**,(ii)**Z-Score** and (iii)**DBSCAN**

The dataset I am going to use for this article is the famous **Titanic Dataset.** I have made some modifications to the dataset like feature creation, removed some irrelevant features from the dataset. There are two continuous features in the dataset Age and Fare.

Before performing any outlier analysis I am going to train a RandomForestClassifier algorithm on the dataset to check the accuracy of the model with the outliers present. After performing each outlier technique, we are going to check the accuracy of the model.

#Loading the libraries

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix,accuracy_score#Split the data into train and test

X_train,X_test,y_train,y_test=train_test_split(df1.iloc[:,df1.columns!='Survived'],df1.Survived,test_size=0.3,random_state=0)#train the model

model_rf=RandomForestClassifier(n_estimators=100)

model_rf.fit(X_train,y_train)

pred=model_rf.predict(X_test)#check the accuracy of the model

accuracy_score(y_test,pred)#The base model gave an accuracy score of 0.8208

Let's get started!!

**IQR(Inter-Quartile-Range)**

The **IQR** describes the middle 50% of values when ordered from lowest to highest. To find the **interquartile range** (**IQR**) first find the median (middle value) of the lower and upper half of the **data**. These values are **quartile** 1 (Q1) and **quartile** 3 (Q3). The **IQR** is the difference between Q3 and Q1.

After calculating the IQR value, we will then calculate the lengths of the minimum and maximum whiskers. Any value that is outside the range of the whiskers is said to be an outlier.

out=['Age','Fare']

for i in out:

q25,q75=np.percentile(df[i],[25,75])

iqr=q75-q25

minimum=q25-(iqr*1.5)

maximum=q75+(iqr*1.5)

#assigning nan to the outliers

df[i].iloc[df[df[i]<minimum].index]=np.nan

df[i].iloc[df[df[i]>maximum].index]=np.nan#Accuracy: 0.7985

What??? The accuracy actually dropped!?!? That is because IQR is a very aggressive method, and sometimes it treats normal values as outliers. Hence reducing the accuracy of the model. Let’s perform IQR again but this time we will replace 1.5 with 3, to remove the extreme outliers from the dataset.

out=['Age','Fare']for i in out:

q25,q75=np.percentile(df[i],[25,75])

iqr=q75-q25

minimum=q25-(iqr*3)

maximum=q75+(iqr*3)#assigning nan to the outliersdf[i].iloc[df[df[i]<minimum].index]=np.nan

df[i].iloc[df[df[i]>maximum].index]=np.nan#imputing nan valuesdf1['Age']=df1.Age.fillna(df1.Age.mean())

df1['Fare']=df1.Fare.fillna(df1.Fare.mean())#Accuracy: 0.8320

**Z-Score:**

Z-scores can quantify the unusualness of observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean.

To calculate the Z-score for observation, take the raw measurement, subtract the mean, and divide by the standard deviation. Mathematically, the formula for that process is the following:

The further away an observation’s Z-score is from zero, the more unusual it is. A standard cut-off value for finding outliers are Z-scores of **+/-3 **or further from zero.

outliers=[]

def outliers_z_score(data):

threshold = 3mean_y = np.mean(data)

stdev_y = np.std(data)

for i in data:

z_score=(i-mean_y)/stdev_y

if np.abs(z_score)>threshold:

outliers.append(i)

return outliers#age featurea=outliers_z_score(df.Age)

for i in a:

df['Age'].iloc[df[df.Age==i].index]=np.nan#Fare featureb=outliers_z_score(df.Fare)

for i in b:

df['Fare'].iloc[df[df.Fare==i].index]=np.nan#imputing nan valuesdf1['Age']=df1.Age.fillna(df1.Age.mean())

df1['Fare']=df1.Fare.fillna(df1.Fare.mean())#Accuracy: 0.84328

One drawback of Z-score is that it doesn’t work so well when the number of observations is very less, say 12 observations(rarely happens in real-world). In that case, we can use the **Modified Z-Score** method.

**DBSCAN**

DBSCAN is a clustering method that is used in machine learning to separate clusters of high density from clusters of low density. Given that DBSCAN is a density-based clustering algorithm, it does a great job of seeking areas in the data that have a high density of observations, versus areas of the data that are not very dense with observations. DBSCAN can sort data into clusters of varying shapes as well, another strong advantage. **DBSCAN assigns -1 value to the outliers in the dataset.**

from sklearn.cluster import DBSCAN

clustering = DBSCAN(eps=3, min_samples=2).fit(df1.iloc[:,0:2])

clustering.labels_#Play around with theepsandmin_samplesvalues to get the best results#creating a new column in the datasetdf1['outlier']=clustering.labels_#assigning nan's to -1 denoted values in Age and Faredf1.Age.iloc[df1[df1.outlier==-1].index]=np.nan

df1.Fare.iloc[df1[df1.outlier==-1].index]=np.nan#imputing nan valuesdf1['Age']=df1.Age.fillna(df1.Age.mean())

df1['Fare']=df1.Fare.fillna(df1.Fare.mean())#Accuracy:0.83208

So as you can see how removing outliers from the dataset can boost the accuracy of the model. Personally I like using IQR and Z-Score methods as they are reliable and efficient. I hope you found this helpful in some way.

Some articles that you should go through before performing outlier analysis:

Data Types in ML: https://medium.com/@nitin9809/data-types-for-ml-beginners-b94fc9d88ed

Missing Value Analysis and Imputation: https://medium.com/@nitin9809/missing-value-analysis-and-treatment-for-beginners-2fc382caeb54

GITHUB: https://github.com/nitin689