Introduction

In my previous post I described how to do cluster analysis using Rapidminer. A better look at various modeling techniques available in Rapidminer will let you know the capability of this software as far as data mining is concerned. Moreover, it is menu driven and users are not required to write any codes for doing analysis. However, advanced users can write codes in one of the operators to do even more sophisticated analysis. It is worth exploring the capabilities of Rapidminer and I shall give many demonstrations in future posts on Rapidminer. R and SAS are also very capable in analyzing data and R is superior to Rapidminer mainly due to the active community supporting R in many ways. Particularly, in graphics capabilities, R is definitely a winner. In this post, I shall show how using SAS clustering can be done.

Cluster Analysis using SAS

I am using SAS University Edition, which is a free software for educational purpose with limited features. However, the good part is that it is fully equipped with base SAS and SAS/STAT modules which are essential for doing statistical analyses. Moreover, it is also possible to use R from within SAS using SAS/IML module. However, I am not sure if the same can be achieved using the University Edition. Hence, I shall use the built-in functionality of SAS/STAT in doing cluster analysis. SAS can do cluster analysis using 3 different procedures, i.e. PROC CLUSTER, PROC FASTCLUS and PROC VARCLUS. PROC CLUSTER is the hierarchical clustering method, PROC FASTCLUS is the K-Means clustering and PROC VARCLUS is a special type of clustering where (by default) Principal Component Analysis (PCA) is done to cluster variables. It is argued that VARCLUS algorithm is, many a time, better than simple PCA in dimension reduction along with interpret-ability of each dimension. I am not going to deal with this clustering as of now. PROC CLUSTER and PROC FASTCLUS are going to be used for demonstration.

For this demonstration, I am using the same wholesale data that I used in my previous post. SAS University Edition doesn't have a dedicated procedure for optimizing the number of clusters. Deciding on the optimum number of clusters rests on the analyst. Hierarchical clustering can show the possible number of clusters based on certain criteria but if the number of cases starts increasing and goes beyond 200 cases, the output becomes cumbersome. Hence, a two step method can be adopted. In the first step, generate (say) 100 clusters using K-means clusters and then using the output dataset, run a hierarchical cluster to decide the optimum number of clusters. This approach, even though sounds logical, is different from other approaches like AIC and BIC. And due to this, the outputs are also also going to vary when compared to other approaches. Moreover, K-Means clustering depends on initial seed from which pseudo random numbers are generated. SAS has a mechanism to fix this seed so that output does not change if the algorithm is run multiple number of times. However, if the seed is not fixed in R and Rapidminer, simple K-Means clustering will produce different results each time the algorithm is run. Readers can verify this fact by manually fixing different seeds in simple K-Means clustering while working with Rapidminer.

To run cluster analysis using SAS, in the begining, I shall run PROC FASTCLUS to generate 30 clusters using the code given below. The OUTSEED= option captures the centroid values of individual clusters. The the same centroid values will be used in subsequent hierarchical clustering (PROC CLUSTER). MAXC= option specifies 30 clusters and CCC, PSEUDO options are mentioned so that, apart from dendogram, a few more important graphs are generated for proper cluster number identification.

proc fastclus data=subhasis.wholesale(drop=region channel)
  outseed=clustmeans
  out=newdata 
  maxc=30;
run;

proc cluster data=clustmeans(drop=_crit_ cluster _RMSSTD_ _freq_ _radius_ _NEAR_ _GAP_) 
  ccc
  method=ward 
  pseudo;
run;

PROC CLUSTER gives a detailed output with dendogram and 3 different graphs. The graphs are shown below.

Pseudo-F graph shows a pick at 3 clusters and afterward, its value remained almost constant. Pseudo-T square graph shows that there are two possible number of clusters i.e. 3 and 5 (moving from right to left, after 3 and 5 clusters there are sudden jumps in value). Dendogram also suggests 3 and 5 clusters. Hence, now I can run cluster analysis using PROC FASTCLUS with MAXC= option as 3 and 5 respectively. The outputs are shown below. The first one is MAXC=5 and the second one is MAXC=3. Since CCC value of 3 cluster solution is below 2, I am going to accept that 5 cluster solution is more appropriate. The centroid values of each cluster is given along with number of cases in each cluster (as frequency). Looking at these values, clusters can be identified with their specific characteristics (based on centroid values). It is to be kept in mind that cluster analysis gets affected by variability existing within each variable. A better option is to standardize the variables using PROC STDIZE before running the same analysis. In my next post I shall show how to run cluster analysis using R.

Results: Cluster Analysis.sas

The SAS System

The FASTCLUS Procedure

Replace=FULL Radius=0 Maxclusters=5 Maxiter=1000 Converge=0

The FASTCLUS Procedure

Initial Seeds

Initial Seeds
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	112151.0000	29627.0000	18148.0000	16745.0000	4948.0000	8550.0000
2	32717.0000	16784.0000	13626.0000	60869.0000	1272.0000	5609.0000
3	8565.0000	4980.0000	67298.0000	131.0000	38102.0000	1215.0000
4	22925.0000	73498.0000	32114.0000	987.0000	20070.0000	903.0000
5	190.0000	727.0000	2012.0000	245.0000	184.0000	127.0000

Minimum Distance

Minimum Distance Between Initial Seeds =	71813.81

Iteration History

Iteration History
Iteration	Criterion	Relative Change in Cluster Seeds
Iteration	Criterion	1	2	3	4	5
1	8376.5	0.4524	0.3885	0.3772	0.3175	0.1843
2	6010.8	0.2631	0.2443	0.1317	0	0.00489
3	5733.4	0.1768	0	0.0805	0	0.00867
4	5487.4	0.1057	0	0.1180	0.1083	0.00901
5	5253.4	0.0796	0	0.1429	0.1609	0.0103
6	4956.7	0.0437	0	0.0663	0.1385	0.0110
7	4829.0	0.0343	0	0.0350	0	0.00789
8	4780.8	0.0179	0	0.0168	0	0.00366
9	4769.4	0.0112	0	0.0142	0	0.00402
10	4760.7	0.00359	0	0.0188	0.1509	0.00189
11	4729.1	0	0	0.0114	0	0.00304
12	4724.5	0	0	0.00701	0	0.00202
13	4722.6	0	0	0.00522	0	0.00153
14	4721.9	0	0	0.00186	0	0.000540
15	4721.8	0	0	0	0	0

Convergence Status

Convergence criterion is satisfied.

Criterion

Criterion Based on Final Seeds =	4721.8

Cluster Summary

Cluster Summary
Cluster	Frequency	RMS Std Deviation	Maximum Distance from Seed to Observation	Radius Exceeded	Nearest Cluster	Distance Between Cluster Centroids
1	58	7234.9	82064.3		5	27453.6
2	2	16287.3	28210.4		1	56957.4
3	84	5125.8	35772.7		5	19336.8
4	5	15387.3	43353.9		3	60396.8
5	291	3440.8	32497.8		3	19336.8

Statistics for Variables

Statistics for Variables
Variable	Total STD	Within STD	R-Square	RSQ/(1-RSQ)
Fresh	12647	8024	0.601115	1.506989
Milk	7380	4817	0.577847	1.368810
Grocery	9503	4857	0.741208	2.864102
Frozen	4855	3620	0.449000	0.814883
Detergents_Paper	4768	2489	0.729897	2.702298
Delicassen	2820	2197	0.398673	0.662990
OVER-ALL	7735	4749	0.626511	1.677456

Pseudo F Statistic

Pseudo F Statistic =	182.42

Approximate Expected Over-All R-Squared

Approximate Expected Over-All R-Squared =	0.59202

Cubic Clustering Criterion

Cubic Clustering Criterion =	4.249

WARNING: The two values above are invalid for correlated variables.

Cluster Means

Cluster Means
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	35981.37931	5205.74138	5922.77586	5266.12069	1049.46552	2231.31034
2	34782.00000	30367.00000	16898.00000	48701.50000	755.50000	26776.00000
3	5176.25000	12308.75000	19113.21429	1655.05952	8426.45238	1980.71429
4	25603.00000	43460.60000	61472.20000	2636.00000	29974.20000	2708.80000
5	8800.09278	3218.04811	4152.48797	2737.48110	1195.13402	1058.59450

Cluster Standard Deviations

Cluster Standard Deviations
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	15493.47236	4856.00030	4209.33955	5031.80137	1320.21701	2378.04033
2	2920.35101	19209.26282	4627.30678	17207.44352	730.44130	29934.65847
3	5327.61431	6741.08773	7354.66199	1769.85611	4452.44466	2602.16641
4	14578.72606	25164.55689	21876.69411	3100.38570	9032.28303	2243.61855
5	6190.24468	2676.09417	3121.02561	3554.05652	1466.72710	1015.19034

Results: Cluster Analysis.sas

The SAS System

The FASTCLUS Procedure

Replace=FULL Radius=0 Maxclusters=3 Maxiter=1000 Converge=0

The FASTCLUS Procedure

Initial Seeds

Initial Seeds
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	112151.0000	29627.0000	18148.0000	16745.0000	4948.0000	8550.0000
2	680.0000	1610.0000	223.0000	862.0000	96.0000	379.0000
3	16117.0000	46197.0000	92780.0000	1026.0000	40827.0000	2944.0000

Minimum Distance

Minimum Distance Between Initial Seeds =	111618.6

Iteration History

Iteration History
Iteration	Criterion	Relative Change in Cluster Seeds
Iteration	Criterion	1	2	3
1	8812.7	0.2911	0.1232	0.3101
2	6419.3	0.1693	0.00386	0.0869
3	6167.3	0.1340	0.00687	0.0614
4	5894.0	0.0720	0.00683	0.0635
5	5742.2	0.0458	0.00595	0.0443
6	5664.8	0.0275	0.00408	0.0355
7	5609.2	0.0215	0.00405	0.0359
8	5571.1	0.00680	0.00210	0.0225
9	5560.8	0.00237	0.000908	0.0130
10	5555.0	0.00240	0.00194	0.0243
11	5540.3	0.00219	0.00134	0.0136
12	5535.6	0	0.000907	0.00827
13	5534.0	0	0.000890	0.00787
14	5531.9	0	0.000932	0.00769
15	5528.4	0	0.00178	0.0131
16	5520.3	0	0.00139	0.00903
17	5517.8	0.00219	0.000957	0.00568
18	5516.7	0.00240	0.000836	0.00302
19	5516.2	0	0	0

Convergence Status

Convergence criterion is satisfied.

Criterion

Criterion Based on Final Seeds =	5516.2

Cluster Summary

Cluster Summary
Cluster	Frequency	RMS Std Deviation	Maximum Distance from Seed to Observation	Radius Exceeded	Nearest Cluster	Distance Between Cluster Centroids
1	60	8531.3	81552.7		2	28176.4
2	330	3778.6	32828.5		1	28176.4
3	50	9473.0	76767.9		2	28765.0

Statistics for Variables

Statistics for Variables
Variable	Total STD	Within STD	R-Square	RSQ/(1-RSQ)
Fresh	12647	8340	0.567123	1.310122
Milk	7380	5769	0.391861	0.644362
Grocery	9503	6395	0.549164	1.218100
Frozen	4855	4640	0.090461	0.099458
Detergents_Paper	4768	3326	0.515664	1.064684
Delicassen	2820	2738	0.061701	0.065758
OVER-ALL	7735	5535	0.490263	0.961797

Pseudo F Statistic

Pseudo F Statistic =	210.15

Approximate Expected Over-All R-Squared

Approximate Expected Over-All R-Squared =	0.47789

Cubic Clustering Criterion

Cubic Clustering Criterion =	1.217

WARNING: The two values above are invalid for correlated variables.

Cluster Means

Cluster Means
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	35941.40000	6044.45000	6288.61667	6713.96667	1039.66667	3049.46667
2	8253.46970	3824.60303	5280.45455	2572.66061	1773.05758	1137.49697
3	8000.04000	18511.42000	27573.90000	1996.68000	12407.36000	2252.02000

Cluster Standard Deviations

Cluster Standard Deviations
Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	15234.89953	7055.55417	4629.03408	9555.16491	1302.21502	6355.49128
2	6194.18203	3191.95806	4370.72957	3404.70887	2185.47863	1280.03870
3	9124.63123	12977.91274	14515.78198	2069.22587	8033.07822	2686.83738

Sunday, 28 September 2014

Cluster Analysis with SAS

Introduction

Cluster Analysis using SAS

Results: Cluster Analysis.sas

The FASTCLUS Procedure

Initial Seeds

Minimum Distance

Iteration History

Convergence Status

Criterion

Cluster Summary

Statistics for Variables

Pseudo F Statistic

Approximate Expected Over-All R-Squared

Cubic Clustering Criterion

Cluster Means

Cluster Standard Deviations

Results: Cluster Analysis.sas

The FASTCLUS Procedure

Initial Seeds

Minimum Distance

Iteration History

Convergence Status

Criterion

Cluster Summary

Statistics for Variables

Pseudo F Statistic

Approximate Expected Over-All R-Squared

Cubic Clustering Criterion

Cluster Means

Cluster Standard Deviations

Let us understand Logistic Regression (Part 2)

Total Pageviews