In this article, I will demonstrate how to detect and remove outliers from your dataset. What is an outlier? In statistics, an outlier is a data point that differs significantly from other observations. Outlier Detection can only be applied to continuous features and it doesn't make much sense using it on categorical features. Before performing any outlier analysis make sure there are no missing values in your dataset.
Matplotlib provides many methods to visualize outliers in a dataset, like box-plot and scatter-plot(1-Dimension and 2-Dimensions)
BOX-PLOT:
import matplotlib.pyplot as plt
%matplotlib inline
plt.boxplot(df.Age)
Scatter-Plot(1-Dimensional & 2-Dimensional)
import matplotlib.pyplot as plt
%matplotlib inlineplt.scatter(df.Age,np.zeros_like(df.Age))
import matplotlib.pyplot as plt
%matplotlib inlineplt.scatter(df.Age,df.Survived)
Some of the methods that I will use in this article are (i)IQR(Inter Quartile Range),(ii)Z-Score and (iii)DBSCAN
The dataset I am going to use for this article is the famous Titanic Dataset. I have made some modifications to the dataset like feature creation, removed some irrelevant features from the dataset. There are two continuous features in the dataset Age and Fare.
Before performing any outlier analysis I am going to train a RandomForestClassifier algorithm on the dataset to check the accuracy of the model with the outliers present. After performing each outlier technique, we are going to check the accuracy of the model.
#Loading the libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score#Split the data into train and test
X_train,X_test,y_train,y_test=train_test_split(df1.iloc[:,df1.columns!='Survived'],df1.Survived,test_size=0.3,random_state=0)#train the model
model_rf=RandomForestClassifier(n_estimators=100)
model_rf.fit(X_train,y_train)
pred=model_rf.predict(X_test)#check the accuracy of the model
accuracy_score(y_test,pred)#The base model gave an accuracy score of 0.8208
Let's get started!!
IQR(Inter-Quartile-Range)
The IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR) first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
After calculating the IQR value, we will then calculate the lengths of the minimum and maximum whiskers. Any value that is outside the range of the whiskers is said to be an outlier.
out=['Age','Fare']
for i in out:
q25,q75=np.percentile(df[i],[25,75])
iqr=q75-q25
minimum=q25-(iqr*1.5)
maximum=q75+(iqr*1.5)
#assigning nan to the outliers
df[i].iloc[df[df[i]<minimum].index]=np.nan
df[i].iloc[df[df[i]>maximum].index]=np.nan#Accuracy: 0.7985
What??? The accuracy actually dropped!?!? That is because IQR is a very aggressive method, and sometimes it treats normal values as outliers. Hence reducing the accuracy of the model. Let’s perform IQR again but this time we will replace 1.5 with 3, to remove the extreme outliers from the dataset.
out=['Age','Fare']for i in out:
q25,q75=np.percentile(df[i],[25,75])
iqr=q75-q25
minimum=q25-(iqr*3)
maximum=q75+(iqr*3)#assigning nan to the outliers
df[i].iloc[df[df[i]<minimum].index]=np.nan
df[i].iloc[df[df[i]>maximum].index]=np.nan#imputing nan values
df1['Age']=df1.Age.fillna(df1.Age.mean())
df1['Fare']=df1.Fare.fillna(df1.Fare.mean())#Accuracy: 0.8320
Z-Score:
Z-scores can quantify the unusualness of observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean.
To calculate the Z-score for observation, take the raw measurement, subtract the mean, and divide by the standard deviation. Mathematically, the formula for that process is the following:
The further away an observation’s Z-score is from zero, the more unusual it is. A standard cut-off value for finding outliers are Z-scores of +/-3 or further from zero.
outliers=[]
def outliers_z_score(data):
threshold = 3mean_y = np.mean(data)
stdev_y = np.std(data)
for i in data:
z_score=(i-mean_y)/stdev_y
if np.abs(z_score)>threshold:
outliers.append(i)
return outliers#age feature
a=outliers_z_score(df.Age)
for i in a:
df['Age'].iloc[df[df.Age==i].index]=np.nan#Fare feature
b=outliers_z_score(df.Fare)
for i in b:
df['Fare'].iloc[df[df.Fare==i].index]=np.nan#imputing nan values
df1['Age']=df1.Age.fillna(df1.Age.mean())
df1['Fare']=df1.Fare.fillna(df1.Fare.mean())#Accuracy: 0.84328
One drawback of Z-score is that it doesn’t work so well when the number of observations is very less, say 12 observations(rarely happens in real-world). In that case, we can use the Modified Z-Score method.
DBSCAN
DBSCAN is a clustering method that is used in machine learning to separate clusters of high density from clusters of low density. Given that DBSCAN is a density-based clustering algorithm, it does a great job of seeking areas in the data that have a high density of observations, versus areas of the data that are not very dense with observations. DBSCAN can sort data into clusters of varying shapes as well, another strong advantage. DBSCAN assigns -1 value to the outliers in the dataset.
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=3, min_samples=2).fit(df1.iloc[:,0:2])
clustering.labels_#Play around with the eps and min_samples values to get the best results#creating a new column in the dataset
df1['outlier']=clustering.labels_#assigning nan's to -1 denoted values in Age and Fare
df1.Age.iloc[df1[df1.outlier==-1].index]=np.nan
df1.Fare.iloc[df1[df1.outlier==-1].index]=np.nan#imputing nan values
df1['Age']=df1.Age.fillna(df1.Age.mean())
df1['Fare']=df1.Fare.fillna(df1.Fare.mean())#Accuracy:0.83208
So as you can see how removing outliers from the dataset can boost the accuracy of the model. Personally I like using IQR and Z-Score methods as they are reliable and efficient. I hope you found this helpful in some way.
Some articles that you should go through before performing outlier analysis:
Data Types in ML: https://medium.com/@nitin9809/data-types-for-ml-beginners-b94fc9d88ed
Missing Value Analysis and Imputation: https://medium.com/@nitin9809/missing-value-analysis-and-treatment-for-beginners-2fc382caeb54
GITHUB: https://github.com/nitin689