Tuesday, 4 August 2015

Detecting outliers using clustering algorithm

Existence of outliers in a dataset poses many difficulties in data modeling. For example, in regression analysis, a group of outlying data points can seriously distort the slope of the regression line to produce a bad fit. Similarly, in k-means clustering, existence of outliers can drag the centroid from its natural point to another point which might not represent the cluster characteristics properly. Almost all statistical data analysis techniques get affected by these outlying points. In group comparison using t-test, presence of one outlier can cause type-II error which is even more critical than type-I error. Thus effective detection of outliers is an important part of data analysis. In some cases those outliers need special treatments and in other cases, they can be safely removed. In this blog, I am only going to discuss detection of outliers because treatments vary from case to case. 

Outliers can be univariate or multivariate. Box and Whisker plot is particularly useful in detecting univariate outliers. Whereas for multivariate cases, other techniques are to be used. One of the such methods of detecting outliers is using Mahalanobis distance. Extracting Local Outlier Factor is another way of detecting multivariate outliers. It works on the basis of nearest-neighborhood concept. Density based clustering algorithm (DBSCAN) can also be used for detecting such outliers. DBSCAN is a special clustering algorithm which can detect irregular shaped clusters and is quite robust to noises and outliers. The algorithm detects clusters on the basis of density reachability concept which resembles nearest-neighborhood concept. DBSCAN needs two parameters to be declared in the beginning, i.e. MinPts and epsilon. However, deciding the optimum value is a problem which needs to be solved first. In this blog, I am going to use the same wholesale dataset which I used in my earlier blog on cluster analysis. The dataset can be downloaded from here.  

The dataset contains outliers and it contains 6 variables. Hence, visualizing the data requires dimension reduction technique. A simple PCA (Principal Component Analysis) is good enough for this job. The code is given below.

> pca=princomp(wholesale[,-c(1,2)],scores = T)

> pca
Call:
princomp(x = wholesale[, -c(1, 2)], scores = T)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6 
12830.468 12046.640  5008.277  3970.892  2319.592  1482.779 

 6  variables and  440 observations.

> summary(pca)
Importance of components:
                             Comp.1       Comp.2       Comp.3       Comp.4       Comp.5       Comp.6
Standard deviation     1.283047e+04 1.204664e+04 5.008277e+03 3.970892e+03 2.319592e+03 1.482779e+03
Proportion of Variance 4.596136e-01 4.051723e-01 7.003008e-02 4.402344e-02 1.502212e-02 6.138475e-03
Cumulative Proportion  4.596136e-01 8.647859e-01 9.348160e-01 9.788394e-01 9.938615e-01 1.000000e+00


The first two variables are not taken into considerations as they are categorical in nature. As per the output, first two components explain 86.4% of total variance and first three components explains 93.5% of total variance. Thus for visualization purpose, if the first two components are taken, the plot will give good approximation of the data distribution. In the above code, score= is kept true so that component scores are made available for further processing. Based on the scores, a 2D scatter plot can now be drawn easily. The code is given below.

library(dplyr)
library(ggplot2)
pca.scores=as.data.frame(pca$scores)
pca.scores %>% select(.,Comp.1,Comp.2) %>%
  qplot(Comp.1,Comp.2,data=.)

As per the plot given below, there is quite a large number of outliers.


The outliers are required to be identified and if required they are needed to be removed. We will see how DBSCAN clustering technique can be used to identify these outliers.

DBSCAN with R

DBSCAN is made available in R via fpc library. As I mentioned earlier, the two important parameters in DBSCAN are MinPts and epsilon. Theory on DBSCAN will not be discussed here as detail information can be found in many other sites. I shall, rather, try to find out how to decide optimum values of MinPts and epsilon. To do this, we need a distance matrix. The following codes will generate a few plots which are helpful in deciding (approximate) optimum values of there two parameters.

# Scaled distance matrix
dist.Mat=dist(scale(wholesale[,-c(1,2)]),method = "euclidean")
hist=hist(dist.Mat,breaks = 100,col = "red",main = "Histogram of Distance Matrix",xlab="Distance")

> head(hist$breaks,n=15)
 [1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8

> head(hist$counts,n=15)
 [1]   95 1115 2920 4617 5583 6810 7371 7300 7041 6147 5566 4998 4782 4033 3614


It is possible to get a histogram of these distance values. The plot is shown below.

Due to presence of outliers, the histogram is also skewed. The peak occurs at around 1.5 value. It means, as the inter point distance is increased, more and more data points are becoming neighbors till the distance reaches 1.5 and afterward, the density starts decreasing. In this case, we essentially see a single mode but in other cases, this diagram may look multi-modal. In those situations, the first maxima can be chosen as the epsilon value. In this case, as shown in the output above, we can choose epsilon value as 1.5 or 1.4. The next thing is to decide MinPts value. To get an approximate MinPts value the following codes are run.

mat=as.matrix(dist.Mat)
mat.long=reshape2::melt(mat)
mat.long$flag=ifelse(mat.long$value>0.0 & mat.long$value<=1.5,1,0)
library(dplyr)
DF=summarise(group_by(mat.long,Var1),count=sum(flag))
hist(DF$count,col="red",main="Histogram of # of points within 1.5", xlab="# of points within 1.5)



The histogram is shown below.


As per the histogram, from 150 onwards there is, a kind of, steady increase and thereafter decrease in the frequencies. Thus, in this case, MinPts value can be fixed at 150.

Once we have got the approximate values of MinPts and epsilon, we can run DBSCAN on the basis of these values. Clustering the wholesale data using DBSCAN is quite easy and is shown in the codes below.


dbscan.fit=dbscan(wholesale[,-c(1,2)],eps = 1.5,MinPts = 150,scale = T,method = "raw")
pca.scores %>% select(.,Comp.1,Comp.2) %>%
  qplot(Comp.1,Comp.2,data=.)+geom_point(col=dbscan.fit$cluster+2)


The red points are outliers in the plot given below.


As it can be seen that most of the outliers are correctly identified. A few red data points are lying within the green data points and that is due to third component which is not shown. Those red points are outliers in vertical direction and hence their projections are lying within the green points.

Thus, density based clustering can be effectively used for outlier detection.

This is just one of the many methods.

Comments are welcome..!!


2 comments:

  1. Thanks for sharing this pretty post to our knowledge, SAS is a program that assists to retrieve, managing and uploading the data & simply it’s an integration system of software for performing these actions, thanks for taking your time to discuss about this topic.
    Regards,

    SAS Training in Chennai|SAS Training Institutes in Chennai

    ReplyDelete
  2. Nice post!! Thanks for sharing. Happy to read your Blog. If you want to know about Google Mesh Setup you can visit here.

    ReplyDelete

EM Algorithm and its usage (Part 2) EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is q...