Introduction
In the previous two posts I have used Rapidminer and SAS to do the clustering of a dataset. In today's post I shall explain how using R cluster analysis can be done. Actually, there is always a concern about the optimum number of clusters and almost all the statistical software has some or other index to determine the optimum number of clusters. R, due to extensive community support, has many packages which makes R a very strong analytics software in today's world. As far as clustering is considered, R is capable of calculating 30 different indices to determine optimum number of clusters. In this post, I shall demonstrate some of the capabilities of R in clustering the same wholesale dataset.
Clustering with R
R uses functions to perform desired analysis and those functions are parts of different libraries. In the beginning I shall start with kmeans() function. But before that, two variables are removed from the analysis i.e. channel and region (because they are nominal in nature). Actually, there is nothing called K-Means clustering algorithm. It is either MacQueen algorithm or Lloyd and Forgy algorithm or Hartigan-Wong algorithm. In most of cases, MacQueen algorithm is used but in R Hartigan-Wong algorithm is used by default. It has been found that Hartigan-Wong algorithm performs better than other two algorithm in most of the situations. Hartigan-Wong algorithm performs better with more than one random start. Hence, 'nstart' option is assigned a value 5. The code is given below.
wholesale=read.csv(file.choose()) #point to the downloaded .csv file
WSdata=wholesale[,-c(1,2)]
cmodel=kmeans(WSdata,centers=3,nstart=5)
The function kmeans() also calculates Within Sum of Squares (WSS) values and the same can be used to determine the optimum number of clusters by changing the 'centers' value and plotting the graph between WSS and number of clusters. The plot is helpful in determining the optimum number of clusters. The elbow point suggests the optimum value. To generate the plot, WSS values are calculated for 10 different runs with different cluster centers. The code is given below.
wss=(nrow(WSdata)-1)*sum(apply(WSdata,2,var)) for(i in 2:10){ wss[i]=sum(kmeans(WSdata,centers=i,nstart=5)$withinss) } plot(1:10,wss,xlab="Number of Clusters",ylab="Within Sum of Squares",type="b") #The code is borrowed from statmethods.net and it can be seen here.However, the plot generated is rather smooth with no clear elbow. The plot is given below.
Hence, some other methods are required to decide the optimum cluster numbers. R has 5 libraries to measure cluster validity. These are "clusterSim", "cclust", "clv", "clvalid" and "NbClust". Out of these 5 libraries, "NbClust" library is capable of calculating all 30 indices for cluster validity by varying number of cluster centers to decide optimum number of clusters. Voting can be used to determine the optimum cluster number if different indices suggest different optimal values. The following one line code gives the optimum number of clusters for the dataset.
> nClust=NbClust(WSdata,distance="euclidean",method="kmeans",min.nc=2,max.nc=10)
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
All 440 observations were used.
*******************************************************************
* Among all indices:
* 1 proposed 2 as the best number of clusters
* 11 proposed 3 as the best number of clusters
* 2 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 4 proposed 8 as the best number of clusters
* 3 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 3
*******************************************************************
Intentionally I have not given the index values. Otherwise, the area would have become very crowded. Readers can view them simply by typing 'nClust' and then hitting on the "Enter" key. The cluster centers and the cluster sizes are given below.> fit=kmeans(WSdata,3,nstart=5)
> fit$centers
Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 8253.47 3824.603 5280.455 2572.661 1773.058 1137.497
2 8000.04 18511.420 27573.900 1996.680 12407.360 2252.020
3 35941.40 6044.450 6288.617 6713.967 1039.667 3049.467
> fit$size
[1] 330 50 60
Thus, the first cluster has 330 customers, second cluster has 50 customers and third cluster has 60 customers. We have seen that Rapidminer's X-Means clustering suggested 4 clusters, SAS suggested 5 clusters and R suggests 3 clusters. So, which one is to be considered? It is to be understood that K-Means clustering gets affected by outliers (or extreme values). This dataset contains extreme values. Hence, it would be a better option to isolate those extreme cases and then running the K-Means clustering. Outlier detection will be dealt separately in another post. This post is meant only to show how to run cluster analysis using R.
Feel free to add comments/suggestions.
Feel free to add comments/suggestions.
No comments:
Post a Comment