Tuesday, 28 October 2014

Factor Analysis

Introduction

Factor analysis is a popular analysis in behavioral research and it has got many application in psychology and consumer research. It is also a dimension reduction tool which combines two or more correlated variables into a single variable. These newly derived variables are called factors or components. Essentially, they are unobserved or latent variables. Some examples of latent variables are Satisfaction, Value for Money, Depression etc. These variables cannot be (and should not be) measured using a single item rating scale (e.g. on a scale of 1 to 10) because each people, in their mindset, has different evaluation criteria in arriving at the final score of the above mentioned variables. To ensure that every respondent think in the same way while evaluating the final score, they are asked multiple questions (also called items) which everybody understands in the same way. In factor analysis, it is assumed that some underlying factors exist in the dataset and by analyzing the correlation or covariance matrix those underlying factors are extracted. There are several methods available for doing this analysis. However, for exploratory study, principal component analysis (PCA) is used most frequently. Principal component analysis is very closely related to factor analysis even though they are not same from conceptual point of view.

Factor Analysis with R

In this demonstration I shall use the Turkiye Student Evaluation dataset which can be found here. This dataset contains 28 variables which are marked using a Likert scale.The descriptions of variables can be seen in the same link. A closer look reveals that the respondents have given their responses in two different aspects. I will show that using factor analysis also the two aspects can be extracted. CRAN has a package called 'psych' which is very comprehensive in various psychological analysis using statistics. Since factor analysis is also a tool in this domain, many different functions related to factor analysis can be found here. Hence, I am going to use this library for the analysis. Before running factor analysis (Exploratory Factor Analysis) it is a good idea to run two tests related to factor analysis. The first test is Kaiser-Meyer-Olkin test of sample adequacy and the second one is Bartlett's test of shericity. Bartlett's test is meant for testing the hypothesis that the correlation matrix is essentially an identity matrix with no correlations among variables. Factor analysis is advisable if KMO value is above 0.5 (or above 0.6) and Bartlett's test p-value is less than 0.05. However, there is one thing which should be kept in mind related to Bartlett test. This test is very sensitive to non-normality of data and it has a tendency to give significant results even if there is no significant correlation (if the data are non-normal). Hence, if the number of variables are less, a visual inspection of correlation matrix and associated p-value matrix could be helpful. 'psych' package has corr.test() function which calculates the correlation matrix along with associated p-values of each correlation coefficient. To get a proper Bartlett test of sphericity in R, I shall use cortest.bartlett() function under the 'psych' library. The entire dataset is imported in R and it is given the name 'studentSurvey', which is a data frame. The KMO test and Bartlett tests are done with the codes given below (the outputs are also shown)

Subhasis >KMO(studentSurvey[,c(6:33)])
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = studentSurvey[, c(6:33)])
Overall MSA =  0.99
MSA for each item = 
  Q1   Q2   Q3   Q4   Q5   Q6   Q7   Q8   Q9  Q10  Q11  Q12  Q13  Q14  Q15  Q16  Q17  Q18  Q19  Q20  Q21  Q22  Q23  Q24 
0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.99 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.98 0.99 0.99 
 Q25  Q26  Q27  Q28 
0.99 0.99 0.99 0.99 
Subhasis >cortest.bartlett(cor(studentSurvey[,c(6:33)]),n=5820)
$chisq
[1] 297891.4

$p.value
[1] 0

$df
[1] 378

In the Bartlett test, correlation matrix was supplied with sample size as 5820 (total number of respondents). Both KMO and Bartlett test suggest that factor analysis is applicable for the dataset. So, I can proceed with factor analysis. R is capable of running factor analysis using different methods such as "Maximum Likelihood", "Minimum Residual", "Principal Axis", "Weighted Least Square", "Generalized Weighted Least Square" etc. For the current study, I shall use the most frequently used procedure i.e. "Principal Axis" method. Principal axis method is very similar to principal component analysis with pre-specified priors which is a matrix of squared multiple correlation among variables. A common problem in factor analysis is deciding the number of factors to retain. And there are different ways of addressing the issue. Factors can be extracted using minimum eigen value criteria, minimum variance explained criteria, plotting scree plot or by running parallel analysis. Scree plot is particularly helpful when a distinct elbow can be seen in the plot. However, if the plot is quite smooth, parallel analysis could be helpful. In this analysis I have run parallel analysis as shown in the code below.

Subhasis >fa.parallel(studentSurvey[,c(6:33)],fm="pa")
Parallel analysis suggests that the number of factors =  5  and the number of components =  2

As per the analysis, there are 5 factors but only 2 components. fa.parallel() function gives scree plot also by default. The scree plot show that there is a sharp elbow at n=2 (i.e. 2 factors). Hence, factor analysis can be done two times to get a better picture (with n=2 and n=5 respectively). The codes and outputs are given below.

Subhasis >fit1=fa(studentSurvey[,c(6:33)],fm="pa",nfactors=2,rotate="varimax")
Subhasis >fit1
Factor Analysis using method =  pa
Call: fa(r = studentSurvey[, c(6:33)], nfactors = 2, rotate = "varimax", 
    fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
     PA1  PA2   h2    u2 com
Q1  0.36 0.80 0.76 0.238 1.4
Q2  0.47 0.79 0.85 0.152 1.6
Q3  0.55 0.70 0.80 0.200 1.9
Q4  0.46 0.79 0.83 0.171 1.6
Q5  0.50 0.80 0.88 0.119 1.7
Q6  0.49 0.77 0.84 0.157 1.7
Q7  0.46 0.82 0.88 0.121 1.6
Q8  0.45 0.81 0.86 0.135 1.6
Q9  0.54 0.70 0.78 0.215 1.9
Q10 0.52 0.79 0.90 0.101 1.7
Q11 0.56 0.69 0.78 0.217 1.9
Q12 0.48 0.75 0.80 0.200 1.7
Q13 0.75 0.56 0.88 0.122 1.9
Q14 0.79 0.52 0.90 0.101 1.7
Q15 0.79 0.52 0.89 0.108 1.7
Q16 0.70 0.61 0.87 0.130 2.0
Q17 0.82 0.41 0.83 0.166 1.5
Q18 0.76 0.54 0.87 0.127 1.8
Q19 0.79 0.52 0.89 0.108 1.7
Q20 0.82 0.49 0.90 0.096 1.6
Q21 0.83 0.46 0.91 0.092 1.6
Q22 0.84 0.46 0.91 0.089 1.6
Q23 0.75 0.57 0.89 0.111 1.9
Q24 0.71 0.60 0.86 0.141 1.9
Q25 0.82 0.47 0.90 0.102 1.6
Q26 0.75 0.55 0.86 0.142 1.8
Q27 0.69 0.57 0.81 0.194 1.9
Q28 0.81 0.46 0.86 0.137 1.6

                        PA1   PA2
SS loadings           12.54 11.47
Proportion Var         0.45  0.41
Cumulative Var         0.45  0.86
Proportion Explained   0.52  0.48
Cumulative Proportion  0.52  1.00

Mean item complexity =  1.7
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  378  and the objective function was  51.28 with Chi Square of  297891.4
The degrees of freedom for the model are 323  and the objective function was  2.99 

The root mean square of the residuals (RMSR) is  0.01 
The df corrected root mean square of the residuals is  0.02 

The harmonic number of observations is  5820 with the empirical chi square  946.46  with prob <  1.7e-62 
The total number of observations was  5820  with MLE Chi Square =  17339.96  with prob <  0 

Tucker Lewis Index of factoring reliability =  0.933
RMSEA index =  0.095  and the 90 % confidence intervals are  0.094 0.096
BIC =  14539.85
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                PA1  PA2
Correlation of scores with factors             0.97 0.97
Multiple R square of scores with factors       0.95 0.94
Minimum correlation of possible factor scores  0.90 0.87

As per the output shown above, cumulative variance explained is 86% by these 2 factors. For almost all statistical purposes, if two factors can explain 86% of total variance, the factor model can be considered to be a very good model. Based on factor loading scores, it can be said that Q1 to Q12 are pointing towards factor 2 and Q13 to Q28 points towards factor1. Factor loading are nothing but correlation of each question (or item) with the respective factors. In the above analysis, if 'nfactors' is made equal to 5, a five factor model will come up. However, it will be seen that the factor loading of all the items with the 3rd, 4th and 5th factor are very low (viewers can see it for themselves). There is one more thing which is to be seen in factor analysis. This is called communality. Communality value shows how much variance of individual item's is explained by the extracted factors (in percentage). If, for any variable, the communality value is low (less than 0.4), that variable can be dropped from the analysis. the communality scores are given below.

Subhasis >fit1$communality
       Q1        Q2        Q3        Q4        Q5        Q6        Q7        Q8 
0.7623041 0.8476720 0.8002198 0.8289699 0.8805149 0.8426552 0.8794662 0.8649415 
       Q9       Q10       Q11       Q12       Q13       Q14       Q15       Q16 
0.7849237 0.8987801 0.7828362 0.8003019 0.8780396 0.8987789 0.8923352 0.8700809 
      Q17       Q18       Q19       Q20       Q21       Q22       Q23       Q24 
0.8337139 0.8732333 0.8918975 0.9040908 0.9081967 0.9105020 0.8891574 0.8591331 
      Q25       Q26       Q27       Q28 
0.8983235 0.8575586 0.8059180 0.8634799

Since, communality scores of each variable is above 0.75, it can be said that all the variables are quite important in this analysis. The next thing to be done is to identify the factors properly. This is somewhat tricky in many situation because exploratory factor analysis often combine variables without any meaningful logical explanation. But in this case, identification of factors is comparatively easier. The first factor is talking about the instructor and the second factor is talking about the course content. Hence, I can say that the first factor is "Instructor Evaluation" and the second factor is "Course Evaluation".

SAS has PROC FACTOR which does the same analysis and produces report-ready outputs. The codes are given below.

proc factor data=subhasis.turkiyestudentdata 
   method=principal
   nfactors=5
   plots=scree
   rotate=varimax;
var Q1--Q28;
run;

Number of factors extracted using this code is 5 so that readers can compare the 2 factors model with the 5 factors model. The output is not shown in this post and readers are requested to run the codes to get the desired results. The scree plot can also be seen in this output.

This is all about factor analysis in a very compact format.

Comments are welcome.

Sunday, 5 October 2014

Cluster Analysis with R

Introduction

In the previous two posts I have used Rapidminer and SAS to do the clustering of a dataset. In today's post I shall explain how using R cluster analysis can be done. Actually, there is always a concern about the optimum number of clusters and almost all the statistical software has some or other index to determine the optimum number of clusters. R, due to extensive community support, has many packages which makes R a very strong analytics software in today's world. As far as clustering is considered, R is capable of calculating 30 different indices to determine optimum number of clusters. In this post, I shall demonstrate some of the capabilities of R in clustering the same wholesale dataset.

Clustering with R

R uses functions to perform desired analysis and those functions are parts of different libraries. In the beginning I shall start with kmeans() function. But before that, two variables are removed from the analysis i.e. channel and region (because they are nominal in nature). Actually, there is nothing called K-Means clustering algorithm. It is either MacQueen algorithm or Lloyd and Forgy algorithm or Hartigan-Wong algorithm. In most of cases, MacQueen algorithm is used but in R Hartigan-Wong algorithm is used by default. It has been found that Hartigan-Wong algorithm performs better than other two algorithm in most of the situations. Hartigan-Wong algorithm performs better with more than one random start. Hence, 'nstart' option is assigned a value 5. The code is given below.

wholesale=read.csv(file.choose()) #point to the downloaded .csv file
WSdata=wholesale[,-c(1,2)]
cmodel=kmeans(WSdata,centers=3,nstart=5)

The function kmeans() also calculates Within Sum of Squares (WSS) values and the same can be used to determine the optimum number of clusters by changing the 'centers' value and plotting the graph between WSS and number of clusters. The plot is helpful in determining the optimum number of clusters. The elbow point suggests the optimum value. To generate the plot, WSS values are calculated for 10 different runs with different cluster centers. The code is given below.

wss=(nrow(WSdata)-1)*sum(apply(WSdata,2,var))
for(i in 2:10){
  wss[i]=sum(kmeans(WSdata,centers=i,nstart=5)$withinss)
}
plot(1:10,wss,xlab="Number of Clusters",ylab="Within Sum of Squares",type="b")

#The code is borrowed from statmethods.net and it can be seen here.

However, the plot generated is rather smooth with no clear elbow. The plot is given below.

Hence, some other methods are required to decide the optimum cluster numbers. R has 5 libraries to measure cluster validity. These are "clusterSim", "cclust", "clv", "clvalid" and "NbClust". Out of these 5 libraries, "NbClust" library is capable of calculating all 30 indices for cluster validity by varying number of cluster centers to decide optimum number of clusters. Voting can be used to determine the optimum cluster number if different indices suggest different optimal values. The following one line code gives the optimum number of clusters for the dataset.

> nClust=NbClust(WSdata,distance="euclidean",method="kmeans",min.nc=2,max.nc=10)
*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 
*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
All 440 observations were used. 
 
******************************************************************* 
* Among all indices:                                                
* 1 proposed 2 as the best number of clusters 
* 11 proposed 3 as the best number of clusters 
* 2 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 1 proposed 7 as the best number of clusters 
* 4 proposed 8 as the best number of clusters 
* 3 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  3 
 
 
*******************************************************************

Intentionally I have not given the index values. Otherwise, the area would have become very crowded. Readers can view them simply by typing 'nClust' and then hitting on the "Enter" key. The cluster centers and the cluster sizes are given below.

> fit=kmeans(WSdata,3,nstart=5)
> fit$centers
     Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
1  8253.47  3824.603  5280.455 2572.661         1773.058   1137.497
2  8000.04 18511.420 27573.900 1996.680        12407.360   2252.020
3 35941.40  6044.450  6288.617 6713.967         1039.667   3049.467
> fit$size
[1] 330  50  60

Thus, the first cluster has 330 customers, second cluster has 50 customers and third cluster has 60 customers. We have seen that Rapidminer's X-Means clustering suggested 4 clusters, SAS suggested 5 clusters and R suggests 3 clusters. So, which one is to be considered? It is to be understood that K-Means clustering gets affected by outliers (or extreme values). This dataset contains extreme values. Hence, it would be a better option to isolate those extreme cases and then running the K-Means clustering. Outlier detection will be dealt separately in another post. This post is meant only to show how to run cluster analysis using R.

Feel free to add comments/suggestions.

Sunday, 28 September 2014

Cluster Analysis with SAS

Introduction

In my previous post I described how to do cluster analysis using Rapidminer. A better look at various modeling techniques available in Rapidminer will let you know the capability of this software as far as data mining is concerned. Moreover, it is menu driven and users are not required to write any codes for doing analysis. However, advanced users can write codes in one of the operators to do even more sophisticated analysis. It is worth exploring the capabilities of Rapidminer and I shall give many demonstrations in future posts on Rapidminer. R and SAS are also very capable in analyzing data and R is superior to Rapidminer mainly due to the active community supporting R in many ways. Particularly, in graphics capabilities, R is definitely a winner. In this post, I shall show how using SAS clustering can be done.

Cluster Analysis using SAS

I am using SAS University Edition, which is a free software for educational purpose with limited features. However, the good part is that it is fully equipped with base SAS and SAS/STAT modules which are essential for doing statistical analyses. Moreover, it is also possible to use R from within SAS using SAS/IML module. However, I am not sure if the same can be achieved using the University Edition. Hence, I shall use the built-in functionality of SAS/STAT in doing cluster analysis. SAS can do cluster analysis using 3 different procedures, i.e. PROC CLUSTER, PROC FASTCLUS and PROC VARCLUS. PROC CLUSTER is the hierarchical clustering method, PROC FASTCLUS is the K-Means clustering and PROC VARCLUS is a special type of clustering where (by default) Principal Component Analysis (PCA) is done to cluster variables. It is argued that VARCLUS algorithm is, many a time, better than simple PCA in dimension reduction along with interpret-ability of each dimension. I am not going to deal with this clustering as of now. PROC CLUSTER and PROC FASTCLUS are going to be used for demonstration.

For this demonstration, I am using the same wholesale data that I used in my previous post. SAS University Edition doesn't have a dedicated procedure for optimizing the number of clusters. Deciding on the optimum number of clusters rests on the analyst. Hierarchical clustering can show the possible number of clusters based on certain criteria but if the number of cases starts increasing and goes beyond 200 cases, the output becomes cumbersome. Hence, a two step method can be adopted. In the first step, generate (say) 100 clusters using K-means clusters and then using the output dataset, run a hierarchical cluster to decide the optimum number of clusters. This approach, even though sounds logical, is different from other approaches like AIC and BIC. And due to this, the outputs are also also going to vary when compared to other approaches. Moreover, K-Means clustering depends on initial seed from which pseudo random numbers are generated. SAS has a mechanism to fix this seed so that output does not change if the algorithm is run multiple number of times. However, if the seed is not fixed in R and Rapidminer, simple K-Means clustering will produce different results each time the algorithm is run. Readers can verify this fact by manually fixing different seeds in simple K-Means clustering while working with Rapidminer.

To run cluster analysis using SAS, in the begining, I shall run PROC FASTCLUS to generate 30 clusters using the code given below. The OUTSEED= option captures the centroid values of individual clusters. The the same centroid values will be used in subsequent hierarchical clustering (PROC CLUSTER). MAXC= option specifies 30 clusters and CCC, PSEUDO options are mentioned so that, apart from dendogram, a few more important graphs are generated for proper cluster number identification.

proc fastclus data=subhasis.wholesale(drop=region channel)
  outseed=clustmeans
  out=newdata 
  maxc=30;
run;

proc cluster data=clustmeans(drop=_crit_ cluster _RMSSTD_ _freq_ _radius_ _NEAR_ _GAP_) 
  ccc
  method=ward 
  pseudo;
run;

PROC CLUSTER gives a detailed output with dendogram and 3 different graphs. The graphs are shown below.

Pseudo-F graph shows a pick at 3 clusters and afterward, its value remained almost constant. Pseudo-T square graph shows that there are two possible number of clusters i.e. 3 and 5 (moving from right to left, after 3 and 5 clusters there are sudden jumps in value). Dendogram also suggests 3 and 5 clusters. Hence, now I can run cluster analysis using PROC FASTCLUS with MAXC= option as 3 and 5 respectively. The outputs are shown below. The first one is MAXC=5 and the second one is MAXC=3. Since CCC value of 3 cluster solution is below 2, I am going to accept that 5 cluster solution is more appropriate. The centroid values of each cluster is given along with number of cases in each cluster (as frequency). Looking at these values, clusters can be identified with their specific characteristics (based on centroid values). It is to be kept in mind that cluster analysis gets affected by variability existing within each variable. A better option is to standardize the variables using PROC STDIZE before running the same analysis. In my next post I shall show how to run cluster analysis using R.

Results: Cluster Analysis.sas

The SAS System

The FASTCLUS Procedure

Replace=FULL Radius=0 Maxclusters=5 Maxiter=1000 Converge=0

The FASTCLUS Procedure

Initial Seeds

Initial Seeds
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	112151.0000	29627.0000	18148.0000	16745.0000	4948.0000	8550.0000
2	32717.0000	16784.0000	13626.0000	60869.0000	1272.0000	5609.0000
3	8565.0000	4980.0000	67298.0000	131.0000	38102.0000	1215.0000
4	22925.0000	73498.0000	32114.0000	987.0000	20070.0000	903.0000
5	190.0000	727.0000	2012.0000	245.0000	184.0000	127.0000

Minimum Distance

Minimum Distance Between Initial Seeds =	71813.81

Iteration History

Iteration History
Iteration	Criterion	Relative Change in Cluster Seeds
Iteration	Criterion	1	2	3	4	5
1	8376.5	0.4524	0.3885	0.3772	0.3175	0.1843
2	6010.8	0.2631	0.2443	0.1317	0	0.00489
3	5733.4	0.1768	0	0.0805	0	0.00867
4	5487.4	0.1057	0	0.1180	0.1083	0.00901
5	5253.4	0.0796	0	0.1429	0.1609	0.0103
6	4956.7	0.0437	0	0.0663	0.1385	0.0110
7	4829.0	0.0343	0	0.0350	0	0.00789
8	4780.8	0.0179	0	0.0168	0	0.00366
9	4769.4	0.0112	0	0.0142	0	0.00402
10	4760.7	0.00359	0	0.0188	0.1509	0.00189
11	4729.1	0	0	0.0114	0	0.00304
12	4724.5	0	0	0.00701	0	0.00202
13	4722.6	0	0	0.00522	0	0.00153
14	4721.9	0	0	0.00186	0	0.000540
15	4721.8	0	0	0	0	0

Convergence Status

Convergence criterion is satisfied.

Criterion

Criterion Based on Final Seeds =	4721.8

Cluster Summary

Cluster Summary
Cluster	Frequency	RMS Std Deviation	Maximum Distance from Seed to Observation	Radius Exceeded	Nearest Cluster	Distance Between Cluster Centroids
1	58	7234.9	82064.3		5	27453.6
2	2	16287.3	28210.4		1	56957.4
3	84	5125.8	35772.7		5	19336.8
4	5	15387.3	43353.9		3	60396.8
5	291	3440.8	32497.8		3	19336.8

Statistics for Variables

Statistics for Variables
Variable	Total STD	Within STD	R-Square	RSQ/(1-RSQ)
Fresh	12647	8024	0.601115	1.506989
Milk	7380	4817	0.577847	1.368810
Grocery	9503	4857	0.741208	2.864102
Frozen	4855	3620	0.449000	0.814883
Detergents_Paper	4768	2489	0.729897	2.702298
Delicassen	2820	2197	0.398673	0.662990
OVER-ALL	7735	4749	0.626511	1.677456

Pseudo F Statistic

Pseudo F Statistic =	182.42

Approximate Expected Over-All R-Squared

Approximate Expected Over-All R-Squared =	0.59202

Cubic Clustering Criterion

Cubic Clustering Criterion =	4.249

WARNING: The two values above are invalid for correlated variables.

Cluster Means

Cluster Means
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	35981.37931	5205.74138	5922.77586	5266.12069	1049.46552	2231.31034
2	34782.00000	30367.00000	16898.00000	48701.50000	755.50000	26776.00000
3	5176.25000	12308.75000	19113.21429	1655.05952	8426.45238	1980.71429
4	25603.00000	43460.60000	61472.20000	2636.00000	29974.20000	2708.80000
5	8800.09278	3218.04811	4152.48797	2737.48110	1195.13402	1058.59450

Cluster Standard Deviations

Cluster Standard Deviations
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	15493.47236	4856.00030	4209.33955	5031.80137	1320.21701	2378.04033
2	2920.35101	19209.26282	4627.30678	17207.44352	730.44130	29934.65847
3	5327.61431	6741.08773	7354.66199	1769.85611	4452.44466	2602.16641
4	14578.72606	25164.55689	21876.69411	3100.38570	9032.28303	2243.61855
5	6190.24468	2676.09417	3121.02561	3554.05652	1466.72710	1015.19034

Results: Cluster Analysis.sas

The SAS System

The FASTCLUS Procedure

Replace=FULL Radius=0 Maxclusters=3 Maxiter=1000 Converge=0

The FASTCLUS Procedure

Initial Seeds

Initial Seeds
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	112151.0000	29627.0000	18148.0000	16745.0000	4948.0000	8550.0000
2	680.0000	1610.0000	223.0000	862.0000	96.0000	379.0000
3	16117.0000	46197.0000	92780.0000	1026.0000	40827.0000	2944.0000

Minimum Distance

Minimum Distance Between Initial Seeds =	111618.6

Iteration History

Iteration History
Iteration	Criterion	Relative Change in Cluster Seeds
Iteration	Criterion	1	2	3
1	8812.7	0.2911	0.1232	0.3101
2	6419.3	0.1693	0.00386	0.0869
3	6167.3	0.1340	0.00687	0.0614
4	5894.0	0.0720	0.00683	0.0635
5	5742.2	0.0458	0.00595	0.0443
6	5664.8	0.0275	0.00408	0.0355
7	5609.2	0.0215	0.00405	0.0359
8	5571.1	0.00680	0.00210	0.0225
9	5560.8	0.00237	0.000908	0.0130
10	5555.0	0.00240	0.00194	0.0243
11	5540.3	0.00219	0.00134	0.0136
12	5535.6	0	0.000907	0.00827
13	5534.0	0	0.000890	0.00787
14	5531.9	0	0.000932	0.00769
15	5528.4	0	0.00178	0.0131
16	5520.3	0	0.00139	0.00903
17	5517.8	0.00219	0.000957	0.00568
18	5516.7	0.00240	0.000836	0.00302
19	5516.2	0	0	0

Convergence Status

Convergence criterion is satisfied.

Criterion

Criterion Based on Final Seeds =	5516.2

Cluster Summary

Cluster Summary
Cluster	Frequency	RMS Std Deviation	Maximum Distance from Seed to Observation	Radius Exceeded	Nearest Cluster	Distance Between Cluster Centroids
1	60	8531.3	81552.7		2	28176.4
2	330	3778.6	32828.5		1	28176.4
3	50	9473.0	76767.9		2	28765.0

Statistics for Variables

Statistics for Variables
Variable	Total STD	Within STD	R-Square	RSQ/(1-RSQ)
Fresh	12647	8340	0.567123	1.310122
Milk	7380	5769	0.391861	0.644362
Grocery	9503	6395	0.549164	1.218100
Frozen	4855	4640	0.090461	0.099458
Detergents_Paper	4768	3326	0.515664	1.064684
Delicassen	2820	2738	0.061701	0.065758
OVER-ALL	7735	5535	0.490263	0.961797

Pseudo F Statistic

Pseudo F Statistic =	210.15

Approximate Expected Over-All R-Squared

Approximate Expected Over-All R-Squared =	0.47789

Cubic Clustering Criterion

Cubic Clustering Criterion =	1.217

WARNING: The two values above are invalid for correlated variables.

Cluster Means

Cluster Means
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	35941.40000	6044.45000	6288.61667	6713.96667	1039.66667	3049.46667
2	8253.46970	3824.60303	5280.45455	2572.66061	1773.05758	1137.49697
3	8000.04000	18511.42000	27573.90000	1996.68000	12407.36000	2252.02000

Cluster Standard Deviations

Cluster Standard Deviations
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	15234.89953	7055.55417	4629.03408	9555.16491	1302.21502	6355.49128
2	6194.18203	3191.95806	4370.72957	3404.70887	2185.47863	1280.03870
3	9124.63123	12977.91274	14515.78198	2069.22587	8033.07822	2686.83738

Tuesday, 28 October 2014

Factor Analysis

Introduction

Factor Analysis with R

Sunday, 5 October 2014

Cluster Analysis with R

Introduction

Clustering with R

Sunday, 28 September 2014

Cluster Analysis with SAS

Introduction

Cluster Analysis using SAS

Results: Cluster Analysis.sas

The FASTCLUS Procedure

Initial Seeds

Minimum Distance

Iteration History

Convergence Status

Criterion

Cluster Summary

Statistics for Variables

Pseudo F Statistic

Approximate Expected Over-All R-Squared

Cubic Clustering Criterion

Cluster Means

Cluster Standard Deviations

Results: Cluster Analysis.sas

The FASTCLUS Procedure

Initial Seeds

Minimum Distance

Iteration History

Convergence Status

Criterion

Cluster Summary

Statistics for Variables

Pseudo F Statistic

Approximate Expected Over-All R-Squared

Cubic Clustering Criterion

Cluster Means

Cluster Standard Deviations

Let us understand Logistic Regression (Part 2)