Tuesday, 28 October 2014

Factor Analysis

Introduction

Factor analysis is a popular analysis in behavioral research and it has got many application in psychology and consumer research. It is also a dimension reduction tool which combines two or more correlated variables into a single variable. These newly derived variables are called factors or components. Essentially, they are unobserved or latent variables. Some examples of latent variables are Satisfaction, Value for Money, Depression etc. These variables cannot be (and should not be) measured using a single item rating scale (e.g. on a scale of 1 to 10) because each people, in their mindset, has different evaluation criteria in arriving at the final score of the above mentioned variables. To ensure that every respondent think in the same way while evaluating the final score, they are asked multiple questions (also called items) which everybody understands in the same way. In factor analysis, it is assumed that some underlying factors exist in the dataset and by analyzing the correlation or covariance matrix those underlying factors are extracted. There are several methods available for doing this analysis. However, for exploratory study, principal component analysis (PCA) is used most frequently. Principal component analysis is very closely related to factor analysis even though they are not same from conceptual point of view.

Factor Analysis with R

In this demonstration I shall use the Turkiye Student Evaluation dataset which can be found here. This dataset contains 28 variables which are marked using a Likert scale.The descriptions of variables can be seen in the same link. A closer look reveals that the respondents have given their responses in two different aspects. I will show that using factor analysis also the two aspects can be extracted. CRAN has a package called 'psych' which is very comprehensive in various psychological analysis using statistics. Since factor analysis is also a tool in this domain, many different functions related to factor analysis can be found here. Hence, I am going to use this library for the analysis. Before running factor analysis (Exploratory Factor Analysis) it is a good idea to run two tests related to factor analysis. The first test is Kaiser-Meyer-Olkin test of sample adequacy and the second one is Bartlett's test of shericity. Bartlett's test is meant for testing the hypothesis that the correlation matrix is essentially an identity matrix with no correlations among variables. Factor analysis is advisable if KMO value is above 0.5 (or above 0.6) and Bartlett's test p-value is less than 0.05. However, there is one thing which should be kept in mind related to Bartlett test. This test is very sensitive to non-normality of data and it has a tendency to give significant results even if there is no significant correlation (if the data are non-normal). Hence, if the number of variables are less, a visual inspection of correlation matrix and associated p-value matrix could be helpful. 'psych' package has corr.test() function which calculates the correlation matrix along with associated p-values of each correlation coefficient. To get a proper Bartlett test of sphericity in R, I shall use cortest.bartlett() function under the 'psych' library. The entire dataset is imported in R and it is given the name 'studentSurvey', which is a data frame. The KMO test and Bartlett tests are done with the codes given below (the outputs are also shown)

Subhasis >KMO(studentSurvey[,c(6:33)])
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = studentSurvey[, c(6:33)])
Overall MSA =  0.99
MSA for each item = 
  Q1   Q2   Q3   Q4   Q5   Q6   Q7   Q8   Q9  Q10  Q11  Q12  Q13  Q14  Q15  Q16  Q17  Q18  Q19  Q20  Q21  Q22  Q23  Q24 
0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.99 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.98 0.99 0.99 
 Q25  Q26  Q27  Q28 
0.99 0.99 0.99 0.99 
Subhasis >cortest.bartlett(cor(studentSurvey[,c(6:33)]),n=5820)
$chisq
[1] 297891.4

$p.value
[1] 0

$df
[1] 378
 In the Bartlett test, correlation matrix was supplied with sample size as 5820 (total number of respondents). Both KMO and Bartlett test suggest that factor analysis is applicable for the dataset. So, I can proceed with factor analysis. R is capable of running factor analysis using different methods such as "Maximum Likelihood", "Minimum Residual", "Principal Axis", "Weighted Least Square", "Generalized Weighted Least Square" etc. For the current study, I shall use the most frequently used procedure i.e. "Principal Axis" method. Principal axis method is very similar to principal component analysis with pre-specified priors which is a matrix of squared multiple correlation among variables. A common problem in factor analysis is deciding the number of factors to retain. And there are different ways of addressing the issue. Factors can be extracted using minimum eigen value criteria, minimum variance explained criteria, plotting scree plot or by running parallel analysis. Scree plot is particularly helpful when a distinct elbow can be seen in the plot. However, if the plot is quite smooth, parallel analysis could be helpful. In this analysis I have run parallel analysis as shown in the code below.
Subhasis >fa.parallel(studentSurvey[,c(6:33)],fm="pa")
Parallel analysis suggests that the number of factors =  5  and the number of components =  2

As per the analysis, there are 5 factors but only 2 components. fa.parallel() function gives scree plot also by default. The scree plot show that there is a sharp elbow at n=2 (i.e. 2 factors). Hence, factor analysis can be done two times to get a better picture (with n=2 and n=5 respectively). The codes and outputs are given below.
Subhasis >fit1=fa(studentSurvey[,c(6:33)],fm="pa",nfactors=2,rotate="varimax")
Subhasis >fit1
Factor Analysis using method =  pa
Call: fa(r = studentSurvey[, c(6:33)], nfactors = 2, rotate = "varimax", 
    fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
     PA1  PA2   h2    u2 com
Q1  0.36 0.80 0.76 0.238 1.4
Q2  0.47 0.79 0.85 0.152 1.6
Q3  0.55 0.70 0.80 0.200 1.9
Q4  0.46 0.79 0.83 0.171 1.6
Q5  0.50 0.80 0.88 0.119 1.7
Q6  0.49 0.77 0.84 0.157 1.7
Q7  0.46 0.82 0.88 0.121 1.6
Q8  0.45 0.81 0.86 0.135 1.6
Q9  0.54 0.70 0.78 0.215 1.9
Q10 0.52 0.79 0.90 0.101 1.7
Q11 0.56 0.69 0.78 0.217 1.9
Q12 0.48 0.75 0.80 0.200 1.7
Q13 0.75 0.56 0.88 0.122 1.9
Q14 0.79 0.52 0.90 0.101 1.7
Q15 0.79 0.52 0.89 0.108 1.7
Q16 0.70 0.61 0.87 0.130 2.0
Q17 0.82 0.41 0.83 0.166 1.5
Q18 0.76 0.54 0.87 0.127 1.8
Q19 0.79 0.52 0.89 0.108 1.7
Q20 0.82 0.49 0.90 0.096 1.6
Q21 0.83 0.46 0.91 0.092 1.6
Q22 0.84 0.46 0.91 0.089 1.6
Q23 0.75 0.57 0.89 0.111 1.9
Q24 0.71 0.60 0.86 0.141 1.9
Q25 0.82 0.47 0.90 0.102 1.6
Q26 0.75 0.55 0.86 0.142 1.8
Q27 0.69 0.57 0.81 0.194 1.9
Q28 0.81 0.46 0.86 0.137 1.6

                        PA1   PA2
SS loadings           12.54 11.47
Proportion Var         0.45  0.41
Cumulative Var         0.45  0.86
Proportion Explained   0.52  0.48
Cumulative Proportion  0.52  1.00

Mean item complexity =  1.7
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  378  and the objective function was  51.28 with Chi Square of  297891.4
The degrees of freedom for the model are 323  and the objective function was  2.99 

The root mean square of the residuals (RMSR) is  0.01 
The df corrected root mean square of the residuals is  0.02 

The harmonic number of observations is  5820 with the empirical chi square  946.46  with prob <  1.7e-62 
The total number of observations was  5820  with MLE Chi Square =  17339.96  with prob <  0 

Tucker Lewis Index of factoring reliability =  0.933
RMSEA index =  0.095  and the 90 % confidence intervals are  0.094 0.096
BIC =  14539.85
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                PA1  PA2
Correlation of scores with factors             0.97 0.97
Multiple R square of scores with factors       0.95 0.94
Minimum correlation of possible factor scores  0.90 0.87
As per the output shown above, cumulative variance explained is 86% by these 2 factors. For almost all statistical purposes, if two factors can explain 86% of total variance, the factor model can be considered to be a very good model. Based on factor loading scores, it can be said that Q1 to Q12 are pointing towards factor 2 and Q13 to Q28 points towards factor1. Factor loading are nothing but correlation of each question (or item) with the respective factors. In the above analysis, if 'nfactors' is made equal to 5, a five factor model will come up. However, it will be seen that the factor loading of all the items with the 3rd, 4th and 5th factor are very low (viewers can see it for themselves). There is one more thing which is to be seen in factor analysis. This is called communality. Communality value shows how much variance of individual item's is explained by the extracted factors (in percentage). If, for any variable, the communality value is low (less than 0.4), that variable can be dropped from the analysis. the communality scores are given below.
Subhasis >fit1$communality
       Q1        Q2        Q3        Q4        Q5        Q6        Q7        Q8 
0.7623041 0.8476720 0.8002198 0.8289699 0.8805149 0.8426552 0.8794662 0.8649415 
       Q9       Q10       Q11       Q12       Q13       Q14       Q15       Q16 
0.7849237 0.8987801 0.7828362 0.8003019 0.8780396 0.8987789 0.8923352 0.8700809 
      Q17       Q18       Q19       Q20       Q21       Q22       Q23       Q24 
0.8337139 0.8732333 0.8918975 0.9040908 0.9081967 0.9105020 0.8891574 0.8591331 
      Q25       Q26       Q27       Q28 
0.8983235 0.8575586 0.8059180 0.8634799

Since, communality scores of each variable is above 0.75, it can be said that all the variables are quite important in this analysis. The next thing to be done is to identify the factors properly. This is somewhat tricky in many situation because exploratory factor analysis often combine variables without any meaningful logical explanation. But in this case, identification of factors is comparatively easier. The first factor is talking about the instructor and the second factor is talking about the course content. Hence, I can say that the first factor is "Instructor Evaluation" and the second factor is "Course Evaluation".

SAS has PROC FACTOR which does the same analysis and produces report-ready outputs. The codes are given below.
proc factor data=subhasis.turkiyestudentdata 
   method=principal
   nfactors=5
   plots=scree
   rotate=varimax;
var Q1--Q28;
run;
Number of factors extracted using this code is 5 so that readers can compare the 2 factors model with the 5 factors model. The output is not shown in this post and readers are requested to run the codes to get the desired results. The scree plot can also be seen in this output.

This is all about factor analysis in a very compact format.

Comments are welcome.

Sunday, 5 October 2014

Cluster Analysis with R

Introduction

In the previous two posts I have used Rapidminer and SAS to do the clustering of a dataset. In today's post I shall explain how using R cluster analysis can be done. Actually, there is always a concern about the optimum number of clusters and almost all the statistical software has some or other index to determine the optimum number of clusters. R, due to extensive community support, has many packages which makes R a very strong analytics software in today's world. As far as clustering is considered, R is capable of calculating 30 different indices to determine optimum number of clusters. In this post, I shall demonstrate some of the capabilities of R in clustering the same wholesale dataset.

Clustering with R

R uses functions to perform desired analysis and those functions are parts of different libraries. In the beginning I shall start with kmeans() function. But before that, two variables are removed from the analysis i.e. channel and region (because they are nominal in nature). Actually, there is nothing called K-Means clustering algorithm. It is either MacQueen algorithm or Lloyd and Forgy algorithm or Hartigan-Wong algorithm. In most of cases, MacQueen algorithm is used but in R Hartigan-Wong algorithm is used by default. It has been found that Hartigan-Wong algorithm performs better than other two algorithm in most of the situations. Hartigan-Wong algorithm performs better with more than one random start. Hence, 'nstart' option is assigned a value 5. The code is given below.
wholesale=read.csv(file.choose()) #point to the downloaded .csv file
WSdata=wholesale[,-c(1,2)]
cmodel=kmeans(WSdata,centers=3,nstart=5)
The function kmeans() also calculates Within Sum of Squares (WSS) values and the same can be used to determine the optimum number of clusters by changing the 'centers' value and plotting the graph between WSS and number of clusters. The plot is helpful in determining the optimum number of clusters. The elbow point suggests the optimum value. To generate the plot, WSS values are calculated for 10 different runs with different cluster centers. The code is given below.
wss=(nrow(WSdata)-1)*sum(apply(WSdata,2,var))
for(i in 2:10){
  wss[i]=sum(kmeans(WSdata,centers=i,nstart=5)$withinss)
}
plot(1:10,wss,xlab="Number of Clusters",ylab="Within Sum of Squares",type="b")

#The code is borrowed from statmethods.net and it can be seen here.
However, the plot generated is rather smooth with no clear elbow. The plot is given below.
Hence, some other methods are required to decide the optimum cluster numbers. R has 5 libraries to measure cluster validity. These are "clusterSim", "cclust", "clv", "clvalid" and "NbClust". Out of these 5 libraries, "NbClust" library is capable of calculating all 30 indices for cluster validity by varying number of cluster centers to decide optimum number of clusters. Voting can be used to determine the optimum cluster number if different indices suggest different optimal values. The following one line code gives the optimum number of clusters for the dataset.
> nClust=NbClust(WSdata,distance="euclidean",method="kmeans",min.nc=2,max.nc=10)
*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
All 440 observations were used. 
 
******************************************************************* 
* Among all indices:                                                
* 1 proposed 2 as the best number of clusters 
* 11 proposed 3 as the best number of clusters 
* 2 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 1 proposed 7 as the best number of clusters 
* 4 proposed 8 as the best number of clusters 
* 3 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
******************************************************************* 
Intentionally I have not given the index values. Otherwise, the area would have become very crowded. Readers can view them simply by typing 'nClust' and then hitting on the "Enter" key. The cluster centers and the cluster sizes are given below.
> fit=kmeans(WSdata,3,nstart=5)
> fit$centers
     Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
1  8253.47  3824.603  5280.455 2572.661         1773.058   1137.497
2  8000.04 18511.420 27573.900 1996.680        12407.360   2252.020
3 35941.40  6044.450  6288.617 6713.967         1039.667   3049.467
> fit$size
[1] 330  50  60
Thus, the first cluster has 330 customers, second cluster has 50 customers and third cluster has 60 customers. We have seen that Rapidminer's X-Means clustering suggested 4 clusters, SAS suggested 5 clusters and R suggests 3 clusters. So, which one is to be considered? It is to be understood that K-Means clustering gets affected by outliers (or extreme values). This dataset contains extreme values. Hence, it would be a better option to isolate those extreme cases and then running the K-Means clustering. Outlier detection will be dealt separately in another post. This post is meant only to show how to run cluster analysis using R.

Feel free to add comments/suggestions.

Sunday, 28 September 2014

Cluster Analysis with SAS

Introduction

In my previous post I described how to do cluster analysis using Rapidminer. A better look at various modeling techniques available in Rapidminer will let you know the capability of this software as far as data mining is concerned. Moreover, it is menu driven and users are not required to write any codes for doing analysis. However, advanced users can write codes in one of the operators to do even more sophisticated analysis. It is worth exploring the capabilities of Rapidminer and I shall give many demonstrations in future posts on Rapidminer. R and SAS are also very capable in analyzing data and R is superior to Rapidminer mainly due to the active community supporting R in many ways. Particularly, in graphics capabilities, R is definitely a winner. In this post, I shall show how using SAS clustering can be done. 

Cluster Analysis using SAS

I am using SAS University Edition, which is a free software for educational purpose with limited features. However, the good part is that it is fully equipped with base SAS and SAS/STAT modules which are essential for doing statistical analyses. Moreover, it is also possible to use R from within SAS using SAS/IML module. However, I am not sure if the same can be achieved using the University Edition. Hence, I shall use the built-in functionality of SAS/STAT in doing cluster analysis. SAS can do cluster analysis using 3 different procedures, i.e. PROC CLUSTER, PROC FASTCLUS and PROC VARCLUS. PROC CLUSTER is the hierarchical clustering method, PROC FASTCLUS is the K-Means clustering and PROC VARCLUS is a special type of clustering where (by default) Principal Component Analysis (PCA) is done to cluster variables. It is argued that VARCLUS algorithm is, many a time, better than simple PCA in dimension reduction along with interpret-ability of each dimension. I am not going to deal with this clustering as of now. PROC CLUSTER and PROC FASTCLUS are going to be used for demonstration.

For this demonstration, I am using the same wholesale data that I used in my previous post. SAS University Edition doesn't have a dedicated procedure for optimizing the number of clusters. Deciding on the optimum number of clusters rests on the analyst. Hierarchical clustering can show the possible number of clusters based on certain criteria but if the number of cases starts increasing and goes beyond 200 cases, the output becomes cumbersome. Hence, a two step method can be adopted. In the first step, generate (say) 100 clusters using K-means clusters and then using the output dataset, run a hierarchical cluster to decide the optimum number of clusters. This approach, even though sounds logical, is different from other approaches like AIC and BIC. And due to this, the outputs are also also going to vary when compared to other approaches. Moreover, K-Means clustering depends on initial seed from which pseudo random numbers are generated. SAS has a mechanism to fix this seed so that output does not change if the algorithm is run multiple number of times. However, if the seed is not fixed in R and Rapidminer, simple K-Means clustering will produce different results each time the algorithm is run. Readers can verify this fact by manually fixing different seeds in simple K-Means clustering while working with Rapidminer.

To run cluster analysis using SAS, in the begining, I shall run PROC FASTCLUS to generate 30 clusters using the code given below. The OUTSEED= option captures the centroid values of individual clusters. The the same centroid values will be used in subsequent hierarchical clustering (PROC CLUSTER). MAXC= option specifies 30 clusters and CCC, PSEUDO options are mentioned so that, apart from dendogram, a few more important graphs are generated for proper cluster number identification.

proc fastclus data=subhasis.wholesale(drop=region channel)
  outseed=clustmeans
  out=newdata 
  maxc=30;
run;

proc cluster data=clustmeans(drop=_crit_ cluster _RMSSTD_ _freq_ _radius_ _NEAR_ _GAP_) 
  ccc
  method=ward 
  pseudo;
run;

PROC CLUSTER gives a detailed output with dendogram and 3 different graphs. The graphs are shown below.



Pseudo-F graph shows a pick at 3 clusters and afterward, its value remained almost constant. Pseudo-T square graph shows that there are two possible number of clusters i.e. 3 and 5 (moving from right to left, after 3 and 5 clusters there are sudden jumps in value). Dendogram also suggests 3 and 5 clusters. Hence, now I can run cluster analysis using PROC FASTCLUS with MAXC= option as 3 and 5 respectively. The outputs are shown below. The first one is MAXC=5 and the second one is MAXC=3. Since CCC value of 3 cluster solution is below 2, I am going to accept that 5 cluster solution is more appropriate. The centroid values of each cluster is given along with number of cases in each cluster (as frequency). Looking at these values, clusters can be identified with their specific characteristics (based on centroid values). It is to be kept in mind that cluster analysis gets affected by variability existing within each variable. A better option is to standardize the variables using PROC STDIZE before running the same analysis. In my next post I shall show how to run cluster analysis using R.

Results: Cluster Analysis.sas

Results: Cluster Analysis.sas

The SAS System
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=5 Maxiter=1000 Converge=0

The FASTCLUS Procedure

Initial Seeds

Initial Seeds
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 112151.0000 29627.0000 18148.0000 16745.0000 4948.0000 8550.0000
2 32717.0000 16784.0000 13626.0000 60869.0000 1272.0000 5609.0000
3 8565.0000 4980.0000 67298.0000 131.0000 38102.0000 1215.0000
4 22925.0000 73498.0000 32114.0000 987.0000 20070.0000 903.0000
5 190.0000 727.0000 2012.0000 245.0000 184.0000 127.0000

Minimum Distance

Minimum Distance Between Initial Seeds = 71813.81

Iteration History

Iteration History
Iteration Criterion Relative Change in Cluster Seeds
1 2 3 4 5
1 8376.5 0.4524 0.3885 0.3772 0.3175 0.1843
2 6010.8 0.2631 0.2443 0.1317 0 0.00489
3 5733.4 0.1768 0 0.0805 0 0.00867
4 5487.4 0.1057 0 0.1180 0.1083 0.00901
5 5253.4 0.0796 0 0.1429 0.1609 0.0103
6 4956.7 0.0437 0 0.0663 0.1385 0.0110
7 4829.0 0.0343 0 0.0350 0 0.00789
8 4780.8 0.0179 0 0.0168 0 0.00366
9 4769.4 0.0112 0 0.0142 0 0.00402
10 4760.7 0.00359 0 0.0188 0.1509 0.00189
11 4729.1 0 0 0.0114 0 0.00304
12 4724.5 0 0 0.00701 0 0.00202
13 4722.6 0 0 0.00522 0 0.00153
14 4721.9 0 0 0.00186 0 0.000540
15 4721.8 0 0 0 0 0

Convergence Status

Convergence criterion is satisfied.

Criterion

Criterion Based on Final Seeds = 4721.8

Cluster Summary

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 58 7234.9 82064.3   5 27453.6
2 2 16287.3 28210.4   1 56957.4
3 84 5125.8 35772.7   5 19336.8
4 5 15387.3 43353.9   3 60396.8
5 291 3440.8 32497.8   3 19336.8

Statistics for Variables

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
Fresh 12647 8024 0.601115 1.506989
Milk 7380 4817 0.577847 1.368810
Grocery 9503 4857 0.741208 2.864102
Frozen 4855 3620 0.449000 0.814883
Detergents_Paper 4768 2489 0.729897 2.702298
Delicassen 2820 2197 0.398673 0.662990
OVER-ALL 7735 4749 0.626511 1.677456

Pseudo F Statistic

Pseudo F Statistic = 182.42

Approximate Expected Over-All R-Squared

Approximate Expected Over-All R-Squared = 0.59202

Cubic Clustering Criterion

Cubic Clustering Criterion = 4.249
WARNING: The two values above are invalid for correlated variables.

Cluster Means

Cluster Means
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 35981.37931 5205.74138 5922.77586 5266.12069 1049.46552 2231.31034
2 34782.00000 30367.00000 16898.00000 48701.50000 755.50000 26776.00000
3 5176.25000 12308.75000 19113.21429 1655.05952 8426.45238 1980.71429
4 25603.00000 43460.60000 61472.20000 2636.00000 29974.20000 2708.80000
5 8800.09278 3218.04811 4152.48797 2737.48110 1195.13402 1058.59450

Cluster Standard Deviations

Cluster Standard Deviations
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 15493.47236 4856.00030 4209.33955 5031.80137 1320.21701 2378.04033
2 2920.35101 19209.26282 4627.30678 17207.44352 730.44130 29934.65847
3 5327.61431 6741.08773 7354.66199 1769.85611 4452.44466 2602.16641
4 14578.72606 25164.55689 21876.69411 3100.38570 9032.28303 2243.61855
5 6190.24468 2676.09417 3121.02561 3554.05652 1466.72710 1015.19034


Results: Cluster Analysis.sas

Results: Cluster Analysis.sas

The SAS System
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=1000 Converge=0

The FASTCLUS Procedure

Initial Seeds

Initial Seeds
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 112151.0000 29627.0000 18148.0000 16745.0000 4948.0000 8550.0000
2 680.0000 1610.0000 223.0000 862.0000 96.0000 379.0000
3 16117.0000 46197.0000 92780.0000 1026.0000 40827.0000 2944.0000

Minimum Distance

Minimum Distance Between Initial Seeds = 111618.6

Iteration History

Iteration History
Iteration Criterion Relative Change in Cluster Seeds
1 2 3
1 8812.7 0.2911 0.1232 0.3101
2 6419.3 0.1693 0.00386 0.0869
3 6167.3 0.1340 0.00687 0.0614
4 5894.0 0.0720 0.00683 0.0635
5 5742.2 0.0458 0.00595 0.0443
6 5664.8 0.0275 0.00408 0.0355
7 5609.2 0.0215 0.00405 0.0359
8 5571.1 0.00680 0.00210 0.0225
9 5560.8 0.00237 0.000908 0.0130
10 5555.0 0.00240 0.00194 0.0243
11 5540.3 0.00219 0.00134 0.0136
12 5535.6 0 0.000907 0.00827
13 5534.0 0 0.000890 0.00787
14 5531.9 0 0.000932 0.00769
15 5528.4 0 0.00178 0.0131
16 5520.3 0 0.00139 0.00903
17 5517.8 0.00219 0.000957 0.00568
18 5516.7 0.00240 0.000836 0.00302
19 5516.2 0 0 0

Convergence Status

Convergence criterion is satisfied.

Criterion

Criterion Based on Final Seeds = 5516.2

Cluster Summary

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 60 8531.3 81552.7   2 28176.4
2 330 3778.6 32828.5   1 28176.4
3 50 9473.0 76767.9   2 28765.0

Statistics for Variables

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
Fresh 12647 8340 0.567123 1.310122
Milk 7380 5769 0.391861 0.644362
Grocery 9503 6395 0.549164 1.218100
Frozen 4855 4640 0.090461 0.099458
Detergents_Paper 4768 3326 0.515664 1.064684
Delicassen 2820 2738 0.061701 0.065758
OVER-ALL 7735 5535 0.490263 0.961797

Pseudo F Statistic

Pseudo F Statistic = 210.15

Approximate Expected Over-All R-Squared

Approximate Expected Over-All R-Squared = 0.47789

Cubic Clustering Criterion

Cubic Clustering Criterion = 1.217
WARNING: The two values above are invalid for correlated variables.

Cluster Means

Cluster Means
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 35941.40000 6044.45000 6288.61667 6713.96667 1039.66667 3049.46667
2 8253.46970 3824.60303 5280.45455 2572.66061 1773.05758 1137.49697
3 8000.04000 18511.42000 27573.90000 1996.68000 12407.36000 2252.02000

Cluster Standard Deviations

Cluster Standard Deviations
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 15234.89953 7055.55417 4629.03408 9555.16491 1302.21502 6355.49128
2 6194.18203 3191.95806 4370.72957 3404.70887 2185.47863 1280.03870
3 9124.63123 12977.91274 14515.78198 2069.22587 8033.07822 2686.83738


EM Algorithm and its usage (Part 2) EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is q...