Sunday, 28 September 2014

Cluster Analysis with SAS

Introduction

In my previous post I described how to do cluster analysis using Rapidminer. A better look at various modeling techniques available in Rapidminer will let you know the capability of this software as far as data mining is concerned. Moreover, it is menu driven and users are not required to write any codes for doing analysis. However, advanced users can write codes in one of the operators to do even more sophisticated analysis. It is worth exploring the capabilities of Rapidminer and I shall give many demonstrations in future posts on Rapidminer. R and SAS are also very capable in analyzing data and R is superior to Rapidminer mainly due to the active community supporting R in many ways. Particularly, in graphics capabilities, R is definitely a winner. In this post, I shall show how using SAS clustering can be done. 

Cluster Analysis using SAS

I am using SAS University Edition, which is a free software for educational purpose with limited features. However, the good part is that it is fully equipped with base SAS and SAS/STAT modules which are essential for doing statistical analyses. Moreover, it is also possible to use R from within SAS using SAS/IML module. However, I am not sure if the same can be achieved using the University Edition. Hence, I shall use the built-in functionality of SAS/STAT in doing cluster analysis. SAS can do cluster analysis using 3 different procedures, i.e. PROC CLUSTER, PROC FASTCLUS and PROC VARCLUS. PROC CLUSTER is the hierarchical clustering method, PROC FASTCLUS is the K-Means clustering and PROC VARCLUS is a special type of clustering where (by default) Principal Component Analysis (PCA) is done to cluster variables. It is argued that VARCLUS algorithm is, many a time, better than simple PCA in dimension reduction along with interpret-ability of each dimension. I am not going to deal with this clustering as of now. PROC CLUSTER and PROC FASTCLUS are going to be used for demonstration.

For this demonstration, I am using the same wholesale data that I used in my previous post. SAS University Edition doesn't have a dedicated procedure for optimizing the number of clusters. Deciding on the optimum number of clusters rests on the analyst. Hierarchical clustering can show the possible number of clusters based on certain criteria but if the number of cases starts increasing and goes beyond 200 cases, the output becomes cumbersome. Hence, a two step method can be adopted. In the first step, generate (say) 100 clusters using K-means clusters and then using the output dataset, run a hierarchical cluster to decide the optimum number of clusters. This approach, even though sounds logical, is different from other approaches like AIC and BIC. And due to this, the outputs are also also going to vary when compared to other approaches. Moreover, K-Means clustering depends on initial seed from which pseudo random numbers are generated. SAS has a mechanism to fix this seed so that output does not change if the algorithm is run multiple number of times. However, if the seed is not fixed in R and Rapidminer, simple K-Means clustering will produce different results each time the algorithm is run. Readers can verify this fact by manually fixing different seeds in simple K-Means clustering while working with Rapidminer.

To run cluster analysis using SAS, in the begining, I shall run PROC FASTCLUS to generate 30 clusters using the code given below. The OUTSEED= option captures the centroid values of individual clusters. The the same centroid values will be used in subsequent hierarchical clustering (PROC CLUSTER). MAXC= option specifies 30 clusters and CCC, PSEUDO options are mentioned so that, apart from dendogram, a few more important graphs are generated for proper cluster number identification.

proc fastclus data=subhasis.wholesale(drop=region channel)
  outseed=clustmeans
  out=newdata 
  maxc=30;
run;

proc cluster data=clustmeans(drop=_crit_ cluster _RMSSTD_ _freq_ _radius_ _NEAR_ _GAP_) 
  ccc
  method=ward 
  pseudo;
run;

PROC CLUSTER gives a detailed output with dendogram and 3 different graphs. The graphs are shown below.



Pseudo-F graph shows a pick at 3 clusters and afterward, its value remained almost constant. Pseudo-T square graph shows that there are two possible number of clusters i.e. 3 and 5 (moving from right to left, after 3 and 5 clusters there are sudden jumps in value). Dendogram also suggests 3 and 5 clusters. Hence, now I can run cluster analysis using PROC FASTCLUS with MAXC= option as 3 and 5 respectively. The outputs are shown below. The first one is MAXC=5 and the second one is MAXC=3. Since CCC value of 3 cluster solution is below 2, I am going to accept that 5 cluster solution is more appropriate. The centroid values of each cluster is given along with number of cases in each cluster (as frequency). Looking at these values, clusters can be identified with their specific characteristics (based on centroid values). It is to be kept in mind that cluster analysis gets affected by variability existing within each variable. A better option is to standardize the variables using PROC STDIZE before running the same analysis. In my next post I shall show how to run cluster analysis using R.

Results: Cluster Analysis.sas

Results: Cluster Analysis.sas

The SAS System
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=5 Maxiter=1000 Converge=0

The FASTCLUS Procedure

Initial Seeds

Initial Seeds
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 112151.0000 29627.0000 18148.0000 16745.0000 4948.0000 8550.0000
2 32717.0000 16784.0000 13626.0000 60869.0000 1272.0000 5609.0000
3 8565.0000 4980.0000 67298.0000 131.0000 38102.0000 1215.0000
4 22925.0000 73498.0000 32114.0000 987.0000 20070.0000 903.0000
5 190.0000 727.0000 2012.0000 245.0000 184.0000 127.0000

Minimum Distance

Minimum Distance Between Initial Seeds = 71813.81

Iteration History

Iteration History
Iteration Criterion Relative Change in Cluster Seeds
1 2 3 4 5
1 8376.5 0.4524 0.3885 0.3772 0.3175 0.1843
2 6010.8 0.2631 0.2443 0.1317 0 0.00489
3 5733.4 0.1768 0 0.0805 0 0.00867
4 5487.4 0.1057 0 0.1180 0.1083 0.00901
5 5253.4 0.0796 0 0.1429 0.1609 0.0103
6 4956.7 0.0437 0 0.0663 0.1385 0.0110
7 4829.0 0.0343 0 0.0350 0 0.00789
8 4780.8 0.0179 0 0.0168 0 0.00366
9 4769.4 0.0112 0 0.0142 0 0.00402
10 4760.7 0.00359 0 0.0188 0.1509 0.00189
11 4729.1 0 0 0.0114 0 0.00304
12 4724.5 0 0 0.00701 0 0.00202
13 4722.6 0 0 0.00522 0 0.00153
14 4721.9 0 0 0.00186 0 0.000540
15 4721.8 0 0 0 0 0

Convergence Status

Convergence criterion is satisfied.

Criterion

Criterion Based on Final Seeds = 4721.8

Cluster Summary

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 58 7234.9 82064.3   5 27453.6
2 2 16287.3 28210.4   1 56957.4
3 84 5125.8 35772.7   5 19336.8
4 5 15387.3 43353.9   3 60396.8
5 291 3440.8 32497.8   3 19336.8

Statistics for Variables

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
Fresh 12647 8024 0.601115 1.506989
Milk 7380 4817 0.577847 1.368810
Grocery 9503 4857 0.741208 2.864102
Frozen 4855 3620 0.449000 0.814883
Detergents_Paper 4768 2489 0.729897 2.702298
Delicassen 2820 2197 0.398673 0.662990
OVER-ALL 7735 4749 0.626511 1.677456

Pseudo F Statistic

Pseudo F Statistic = 182.42

Approximate Expected Over-All R-Squared

Approximate Expected Over-All R-Squared = 0.59202

Cubic Clustering Criterion

Cubic Clustering Criterion = 4.249
WARNING: The two values above are invalid for correlated variables.

Cluster Means

Cluster Means
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 35981.37931 5205.74138 5922.77586 5266.12069 1049.46552 2231.31034
2 34782.00000 30367.00000 16898.00000 48701.50000 755.50000 26776.00000
3 5176.25000 12308.75000 19113.21429 1655.05952 8426.45238 1980.71429
4 25603.00000 43460.60000 61472.20000 2636.00000 29974.20000 2708.80000
5 8800.09278 3218.04811 4152.48797 2737.48110 1195.13402 1058.59450

Cluster Standard Deviations

Cluster Standard Deviations
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 15493.47236 4856.00030 4209.33955 5031.80137 1320.21701 2378.04033
2 2920.35101 19209.26282 4627.30678 17207.44352 730.44130 29934.65847
3 5327.61431 6741.08773 7354.66199 1769.85611 4452.44466 2602.16641
4 14578.72606 25164.55689 21876.69411 3100.38570 9032.28303 2243.61855
5 6190.24468 2676.09417 3121.02561 3554.05652 1466.72710 1015.19034


Results: Cluster Analysis.sas

Results: Cluster Analysis.sas

The SAS System
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=1000 Converge=0

The FASTCLUS Procedure

Initial Seeds

Initial Seeds
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 112151.0000 29627.0000 18148.0000 16745.0000 4948.0000 8550.0000
2 680.0000 1610.0000 223.0000 862.0000 96.0000 379.0000
3 16117.0000 46197.0000 92780.0000 1026.0000 40827.0000 2944.0000

Minimum Distance

Minimum Distance Between Initial Seeds = 111618.6

Iteration History

Iteration History
Iteration Criterion Relative Change in Cluster Seeds
1 2 3
1 8812.7 0.2911 0.1232 0.3101
2 6419.3 0.1693 0.00386 0.0869
3 6167.3 0.1340 0.00687 0.0614
4 5894.0 0.0720 0.00683 0.0635
5 5742.2 0.0458 0.00595 0.0443
6 5664.8 0.0275 0.00408 0.0355
7 5609.2 0.0215 0.00405 0.0359
8 5571.1 0.00680 0.00210 0.0225
9 5560.8 0.00237 0.000908 0.0130
10 5555.0 0.00240 0.00194 0.0243
11 5540.3 0.00219 0.00134 0.0136
12 5535.6 0 0.000907 0.00827
13 5534.0 0 0.000890 0.00787
14 5531.9 0 0.000932 0.00769
15 5528.4 0 0.00178 0.0131
16 5520.3 0 0.00139 0.00903
17 5517.8 0.00219 0.000957 0.00568
18 5516.7 0.00240 0.000836 0.00302
19 5516.2 0 0 0

Convergence Status

Convergence criterion is satisfied.

Criterion

Criterion Based on Final Seeds = 5516.2

Cluster Summary

Cluster Summary
Cluster Frequency RMS Std Deviation Maximum Distance
from Seed
to Observation
Radius
Exceeded
Nearest Cluster Distance Between
Cluster Centroids
1 60 8531.3 81552.7   2 28176.4
2 330 3778.6 32828.5   1 28176.4
3 50 9473.0 76767.9   2 28765.0

Statistics for Variables

Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
Fresh 12647 8340 0.567123 1.310122
Milk 7380 5769 0.391861 0.644362
Grocery 9503 6395 0.549164 1.218100
Frozen 4855 4640 0.090461 0.099458
Detergents_Paper 4768 3326 0.515664 1.064684
Delicassen 2820 2738 0.061701 0.065758
OVER-ALL 7735 5535 0.490263 0.961797

Pseudo F Statistic

Pseudo F Statistic = 210.15

Approximate Expected Over-All R-Squared

Approximate Expected Over-All R-Squared = 0.47789

Cubic Clustering Criterion

Cubic Clustering Criterion = 1.217
WARNING: The two values above are invalid for correlated variables.

Cluster Means

Cluster Means
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 35941.40000 6044.45000 6288.61667 6713.96667 1039.66667 3049.46667
2 8253.46970 3824.60303 5280.45455 2572.66061 1773.05758 1137.49697
3 8000.04000 18511.42000 27573.90000 1996.68000 12407.36000 2252.02000

Cluster Standard Deviations

Cluster Standard Deviations
Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 15234.89953 7055.55417 4629.03408 9555.16491 1302.21502 6355.49128
2 6194.18203 3191.95806 4370.72957 3404.70887 2185.47863 1280.03870
3 9124.63123 12977.91274 14515.78198 2069.22587 8033.07822 2686.83738


Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do...