Sunday, 5 October 2014

Cluster Analysis with R

Introduction

In the previous two posts I have used Rapidminer and SAS to do the clustering of a dataset. In today's post I shall explain how using R cluster analysis can be done. Actually, there is always a concern about the optimum number of clusters and almost all the statistical software has some or other index to determine the optimum number of clusters. R, due to extensive community support, has many packages which makes R a very strong analytics software in today's world. As far as clustering is considered, R is capable of calculating 30 different indices to determine optimum number of clusters. In this post, I shall demonstrate some of the capabilities of R in clustering the same wholesale dataset.

Clustering with R

R uses functions to perform desired analysis and those functions are parts of different libraries. In the beginning I shall start with kmeans() function. But before that, two variables are removed from the analysis i.e. channel and region (because they are nominal in nature). Actually, there is nothing called K-Means clustering algorithm. It is either MacQueen algorithm or Lloyd and Forgy algorithm or Hartigan-Wong algorithm. In most of cases, MacQueen algorithm is used but in R Hartigan-Wong algorithm is used by default. It has been found that Hartigan-Wong algorithm performs better than other two algorithm in most of the situations. Hartigan-Wong algorithm performs better with more than one random start. Hence, 'nstart' option is assigned a value 5. The code is given below.
wholesale=read.csv(file.choose()) #point to the downloaded .csv file
WSdata=wholesale[,-c(1,2)]
cmodel=kmeans(WSdata,centers=3,nstart=5)
The function kmeans() also calculates Within Sum of Squares (WSS) values and the same can be used to determine the optimum number of clusters by changing the 'centers' value and plotting the graph between WSS and number of clusters. The plot is helpful in determining the optimum number of clusters. The elbow point suggests the optimum value. To generate the plot, WSS values are calculated for 10 different runs with different cluster centers. The code is given below.
wss=(nrow(WSdata)-1)*sum(apply(WSdata,2,var))
for(i in 2:10){
  wss[i]=sum(kmeans(WSdata,centers=i,nstart=5)$withinss)
}
plot(1:10,wss,xlab="Number of Clusters",ylab="Within Sum of Squares",type="b")

#The code is borrowed from statmethods.net and it can be seen here.
However, the plot generated is rather smooth with no clear elbow. The plot is given below.
Hence, some other methods are required to decide the optimum cluster numbers. R has 5 libraries to measure cluster validity. These are "clusterSim", "cclust", "clv", "clvalid" and "NbClust". Out of these 5 libraries, "NbClust" library is capable of calculating all 30 indices for cluster validity by varying number of cluster centers to decide optimum number of clusters. Voting can be used to determine the optimum cluster number if different indices suggest different optimal values. The following one line code gives the optimum number of clusters for the dataset.
> nClust=NbClust(WSdata,distance="euclidean",method="kmeans",min.nc=2,max.nc=10)
*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
All 440 observations were used. 
 
******************************************************************* 
* Among all indices:                                                
* 1 proposed 2 as the best number of clusters 
* 11 proposed 3 as the best number of clusters 
* 2 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 1 proposed 7 as the best number of clusters 
* 4 proposed 8 as the best number of clusters 
* 3 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
******************************************************************* 
Intentionally I have not given the index values. Otherwise, the area would have become very crowded. Readers can view them simply by typing 'nClust' and then hitting on the "Enter" key. The cluster centers and the cluster sizes are given below.
> fit=kmeans(WSdata,3,nstart=5)
> fit$centers
     Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
1  8253.47  3824.603  5280.455 2572.661         1773.058   1137.497
2  8000.04 18511.420 27573.900 1996.680        12407.360   2252.020
3 35941.40  6044.450  6288.617 6713.967         1039.667   3049.467
> fit$size
[1] 330  50  60
Thus, the first cluster has 330 customers, second cluster has 50 customers and third cluster has 60 customers. We have seen that Rapidminer's X-Means clustering suggested 4 clusters, SAS suggested 5 clusters and R suggests 3 clusters. So, which one is to be considered? It is to be understood that K-Means clustering gets affected by outliers (or extreme values). This dataset contains extreme values. Hence, it would be a better option to isolate those extreme cases and then running the K-Means clustering. Outlier detection will be dealt separately in another post. This post is meant only to show how to run cluster analysis using R.

Feel free to add comments/suggestions.

No comments:

Post a Comment

Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do...