Introduction
In my previous post I described how to do cluster analysis using Rapidminer. A better look at various modeling techniques available in Rapidminer will let you know the capability of this software as far as data mining is concerned. Moreover, it is menu driven and users are not required to write any codes for doing analysis. However, advanced users can write codes in one of the operators to do even more sophisticated analysis. It is worth exploring the capabilities of Rapidminer and I shall give many demonstrations in future posts on Rapidminer. R and SAS are also very capable in analyzing data and R is superior to Rapidminer mainly due to the active community supporting R in many ways. Particularly, in graphics capabilities, R is definitely a winner. In this post, I shall show how using SAS clustering can be done.
Cluster Analysis using SAS
I am using SAS University Edition, which is a free software for educational purpose with limited features. However, the good part is that it is fully equipped with base SAS and SAS/STAT modules which are essential for doing statistical analyses. Moreover, it is also possible to use R from within SAS using SAS/IML module. However, I am not sure if the same can be achieved using the University Edition. Hence, I shall use the built-in functionality of SAS/STAT in doing cluster analysis. SAS can do cluster analysis using 3 different procedures, i.e. PROC CLUSTER, PROC FASTCLUS and PROC VARCLUS. PROC CLUSTER is the hierarchical clustering method, PROC FASTCLUS is the K-Means clustering and PROC VARCLUS is a special type of clustering where (by default) Principal Component Analysis (PCA) is done to cluster variables. It is argued that VARCLUS algorithm is, many a time, better than simple PCA in dimension reduction along with interpret-ability of each dimension. I am not going to deal with this clustering as of now. PROC CLUSTER and PROC FASTCLUS are going to be used for demonstration.
For this demonstration, I am using the same wholesale data that I used in my previous post. SAS University Edition doesn't have a dedicated procedure for optimizing the number of clusters. Deciding on the optimum number of clusters rests on the analyst. Hierarchical clustering can show the possible number of clusters based on certain criteria but if the number of cases starts increasing and goes beyond 200 cases, the output becomes cumbersome. Hence, a two step method can be adopted. In the first step, generate (say) 100 clusters using K-means clusters and then using the output dataset, run a hierarchical cluster to decide the optimum number of clusters. This approach, even though sounds logical, is different from other approaches like AIC and BIC. And due to this, the outputs are also also going to vary when compared to other approaches. Moreover, K-Means clustering depends on initial seed from which pseudo random numbers are generated. SAS has a mechanism to fix this seed so that output does not change if the algorithm is run multiple number of times. However, if the seed is not fixed in R and Rapidminer, simple K-Means clustering will produce different results each time the algorithm is run. Readers can verify this fact by manually fixing different seeds in simple K-Means clustering while working with Rapidminer.
To run cluster analysis using SAS, in the begining, I shall run PROC FASTCLUS to generate 30 clusters using the code given below. The OUTSEED= option captures the centroid values of individual clusters. The the same centroid values will be used in subsequent hierarchical clustering (PROC CLUSTER). MAXC= option specifies 30 clusters and CCC, PSEUDO options are mentioned so that, apart from dendogram, a few more important graphs are generated for proper cluster number identification.
proc fastclus data=subhasis.wholesale(drop=region channel)
outseed=clustmeans
out=newdata
maxc=30;
run;
proc cluster data=clustmeans(drop=_crit_ cluster _RMSSTD_ _freq_ _radius_ _NEAR_ _GAP_)
ccc
method=ward
pseudo;
run;
PROC CLUSTER gives a detailed output with dendogram and 3 different graphs. The graphs are shown below.
Pseudo-F graph shows a pick at 3 clusters and afterward, its value remained almost constant. Pseudo-T square graph shows that there are two possible number of clusters i.e. 3 and 5 (moving from right to left, after 3 and 5 clusters there are sudden jumps in value). Dendogram also suggests 3 and 5 clusters. Hence, now I can run cluster analysis using PROC FASTCLUS with MAXC= option as 3 and 5 respectively. The outputs are shown below. The first one is MAXC=5 and the second one is MAXC=3. Since CCC value of 3 cluster solution is below 2, I am going to accept that 5 cluster solution is more appropriate. The centroid values of each cluster is given along with number of cases in each cluster (as frequency). Looking at these values, clusters can be identified with their specific characteristics (based on centroid values). It is to be kept in mind that cluster analysis gets affected by variability existing within each variable. A better option is to standardize the variables using PROC STDIZE before running the same analysis. In my next post I shall show how to run cluster analysis using R.
Results: Cluster Analysis.sas
Results: Cluster Analysis.sas
The SAS System
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=5 Maxiter=1000 Converge=0
The FASTCLUS Procedure
Initial Seeds
112151.0000 |
29627.0000 |
18148.0000 |
16745.0000 |
4948.0000 |
8550.0000 |
32717.0000 |
16784.0000 |
13626.0000 |
60869.0000 |
1272.0000 |
5609.0000 |
8565.0000 |
4980.0000 |
67298.0000 |
131.0000 |
38102.0000 |
1215.0000 |
22925.0000 |
73498.0000 |
32114.0000 |
987.0000 |
20070.0000 |
903.0000 |
190.0000 |
727.0000 |
2012.0000 |
245.0000 |
184.0000 |
127.0000 |
Minimum Distance
Iteration History
8376.5 |
0.4524 |
0.3885 |
0.3772 |
0.3175 |
0.1843 |
6010.8 |
0.2631 |
0.2443 |
0.1317 |
0 |
0.00489 |
5733.4 |
0.1768 |
0 |
0.0805 |
0 |
0.00867 |
5487.4 |
0.1057 |
0 |
0.1180 |
0.1083 |
0.00901 |
5253.4 |
0.0796 |
0 |
0.1429 |
0.1609 |
0.0103 |
4956.7 |
0.0437 |
0 |
0.0663 |
0.1385 |
0.0110 |
4829.0 |
0.0343 |
0 |
0.0350 |
0 |
0.00789 |
4780.8 |
0.0179 |
0 |
0.0168 |
0 |
0.00366 |
4769.4 |
0.0112 |
0 |
0.0142 |
0 |
0.00402 |
4760.7 |
0.00359 |
0 |
0.0188 |
0.1509 |
0.00189 |
4729.1 |
0 |
0 |
0.0114 |
0 |
0.00304 |
4724.5 |
0 |
0 |
0.00701 |
0 |
0.00202 |
4722.6 |
0 |
0 |
0.00522 |
0 |
0.00153 |
4721.9 |
0 |
0 |
0.00186 |
0 |
0.000540 |
4721.8 |
0 |
0 |
0 |
0 |
0 |
Convergence Status
Convergence criterion is satisfied. |
Criterion
Cluster Summary
58 |
7234.9 |
82064.3 |
|
5 |
27453.6 |
2 |
16287.3 |
28210.4 |
|
1 |
56957.4 |
84 |
5125.8 |
35772.7 |
|
5 |
19336.8 |
5 |
15387.3 |
43353.9 |
|
3 |
60396.8 |
291 |
3440.8 |
32497.8 |
|
3 |
19336.8 |
Statistics for Variables
12647 |
8024 |
0.601115 |
1.506989 |
7380 |
4817 |
0.577847 |
1.368810 |
9503 |
4857 |
0.741208 |
2.864102 |
4855 |
3620 |
0.449000 |
0.814883 |
4768 |
2489 |
0.729897 |
2.702298 |
2820 |
2197 |
0.398673 |
0.662990 |
7735 |
4749 |
0.626511 |
1.677456 |
Pseudo F Statistic
Approximate Expected Over-All R-Squared
Cubic Clustering Criterion
WARNING: The two values above are invalid for correlated variables.
Cluster Means
35981.37931 |
5205.74138 |
5922.77586 |
5266.12069 |
1049.46552 |
2231.31034 |
34782.00000 |
30367.00000 |
16898.00000 |
48701.50000 |
755.50000 |
26776.00000 |
5176.25000 |
12308.75000 |
19113.21429 |
1655.05952 |
8426.45238 |
1980.71429 |
25603.00000 |
43460.60000 |
61472.20000 |
2636.00000 |
29974.20000 |
2708.80000 |
8800.09278 |
3218.04811 |
4152.48797 |
2737.48110 |
1195.13402 |
1058.59450 |
Cluster Standard Deviations
15493.47236 |
4856.00030 |
4209.33955 |
5031.80137 |
1320.21701 |
2378.04033 |
2920.35101 |
19209.26282 |
4627.30678 |
17207.44352 |
730.44130 |
29934.65847 |
5327.61431 |
6741.08773 |
7354.66199 |
1769.85611 |
4452.44466 |
2602.16641 |
14578.72606 |
25164.55689 |
21876.69411 |
3100.38570 |
9032.28303 |
2243.61855 |
6190.24468 |
2676.09417 |
3121.02561 |
3554.05652 |
1466.72710 |
1015.19034 |
Results: Cluster Analysis.sas
Results: Cluster Analysis.sas
The SAS System
The FASTCLUS Procedure
Replace=FULL Radius=0 Maxclusters=3 Maxiter=1000 Converge=0
The FASTCLUS Procedure
Initial Seeds
112151.0000 |
29627.0000 |
18148.0000 |
16745.0000 |
4948.0000 |
8550.0000 |
680.0000 |
1610.0000 |
223.0000 |
862.0000 |
96.0000 |
379.0000 |
16117.0000 |
46197.0000 |
92780.0000 |
1026.0000 |
40827.0000 |
2944.0000 |
Minimum Distance
Iteration History
8812.7 |
0.2911 |
0.1232 |
0.3101 |
6419.3 |
0.1693 |
0.00386 |
0.0869 |
6167.3 |
0.1340 |
0.00687 |
0.0614 |
5894.0 |
0.0720 |
0.00683 |
0.0635 |
5742.2 |
0.0458 |
0.00595 |
0.0443 |
5664.8 |
0.0275 |
0.00408 |
0.0355 |
5609.2 |
0.0215 |
0.00405 |
0.0359 |
5571.1 |
0.00680 |
0.00210 |
0.0225 |
5560.8 |
0.00237 |
0.000908 |
0.0130 |
5555.0 |
0.00240 |
0.00194 |
0.0243 |
5540.3 |
0.00219 |
0.00134 |
0.0136 |
5535.6 |
0 |
0.000907 |
0.00827 |
5534.0 |
0 |
0.000890 |
0.00787 |
5531.9 |
0 |
0.000932 |
0.00769 |
5528.4 |
0 |
0.00178 |
0.0131 |
5520.3 |
0 |
0.00139 |
0.00903 |
5517.8 |
0.00219 |
0.000957 |
0.00568 |
5516.7 |
0.00240 |
0.000836 |
0.00302 |
5516.2 |
0 |
0 |
0 |
Convergence Status
Convergence criterion is satisfied. |
Criterion
Cluster Summary
60 |
8531.3 |
81552.7 |
|
2 |
28176.4 |
330 |
3778.6 |
32828.5 |
|
1 |
28176.4 |
50 |
9473.0 |
76767.9 |
|
2 |
28765.0 |
Statistics for Variables
12647 |
8340 |
0.567123 |
1.310122 |
7380 |
5769 |
0.391861 |
0.644362 |
9503 |
6395 |
0.549164 |
1.218100 |
4855 |
4640 |
0.090461 |
0.099458 |
4768 |
3326 |
0.515664 |
1.064684 |
2820 |
2738 |
0.061701 |
0.065758 |
7735 |
5535 |
0.490263 |
0.961797 |
Pseudo F Statistic
Approximate Expected Over-All R-Squared
Cubic Clustering Criterion
WARNING: The two values above are invalid for correlated variables.
Cluster Means
35941.40000 |
6044.45000 |
6288.61667 |
6713.96667 |
1039.66667 |
3049.46667 |
8253.46970 |
3824.60303 |
5280.45455 |
2572.66061 |
1773.05758 |
1137.49697 |
8000.04000 |
18511.42000 |
27573.90000 |
1996.68000 |
12407.36000 |
2252.02000 |
Cluster Standard Deviations
15234.89953 |
7055.55417 |
4629.03408 |
9555.16491 |
1302.21502 |
6355.49128 |
6194.18203 |
3191.95806 |
4370.72957 |
3404.70887 |
2185.47863 |
1280.03870 |
9124.63123 |
12977.91274 |
14515.78198 |
2069.22587 |
8033.07822 |
2686.83738 |
I am not getting the above charts by running the SAS code that u mentioned in the post. Do I need to add anything else? Please let me know.
ReplyDeleteI use SAS University Edition where these graphs are default outcome.
DeleteYou have to use proc tree for output data from proc cluster to get the plot
Delete