Tuesday, 28 October 2014

Factor Analysis

Introduction

Factor analysis is a popular analysis in behavioral research and it has got many application in psychology and consumer research. It is also a dimension reduction tool which combines two or more correlated variables into a single variable. These newly derived variables are called factors or components. Essentially, they are unobserved or latent variables. Some examples of latent variables are Satisfaction, Value for Money, Depression etc. These variables cannot be (and should not be) measured using a single item rating scale (e.g. on a scale of 1 to 10) because each people, in their mindset, has different evaluation criteria in arriving at the final score of the above mentioned variables. To ensure that every respondent think in the same way while evaluating the final score, they are asked multiple questions (also called items) which everybody understands in the same way. In factor analysis, it is assumed that some underlying factors exist in the dataset and by analyzing the correlation or covariance matrix those underlying factors are extracted. There are several methods available for doing this analysis. However, for exploratory study, principal component analysis (PCA) is used most frequently. Principal component analysis is very closely related to factor analysis even though they are not same from conceptual point of view.

Factor Analysis with R

In this demonstration I shall use the Turkiye Student Evaluation dataset which can be found here. This dataset contains 28 variables which are marked using a Likert scale.The descriptions of variables can be seen in the same link. A closer look reveals that the respondents have given their responses in two different aspects. I will show that using factor analysis also the two aspects can be extracted. CRAN has a package called 'psych' which is very comprehensive in various psychological analysis using statistics. Since factor analysis is also a tool in this domain, many different functions related to factor analysis can be found here. Hence, I am going to use this library for the analysis. Before running factor analysis (Exploratory Factor Analysis) it is a good idea to run two tests related to factor analysis. The first test is Kaiser-Meyer-Olkin test of sample adequacy and the second one is Bartlett's test of shericity. Bartlett's test is meant for testing the hypothesis that the correlation matrix is essentially an identity matrix with no correlations among variables. Factor analysis is advisable if KMO value is above 0.5 (or above 0.6) and Bartlett's test p-value is less than 0.05. However, there is one thing which should be kept in mind related to Bartlett test. This test is very sensitive to non-normality of data and it has a tendency to give significant results even if there is no significant correlation (if the data are non-normal). Hence, if the number of variables are less, a visual inspection of correlation matrix and associated p-value matrix could be helpful. 'psych' package has corr.test() function which calculates the correlation matrix along with associated p-values of each correlation coefficient. To get a proper Bartlett test of sphericity in R, I shall use cortest.bartlett() function under the 'psych' library. The entire dataset is imported in R and it is given the name 'studentSurvey', which is a data frame. The KMO test and Bartlett tests are done with the codes given below (the outputs are also shown)

Subhasis >KMO(studentSurvey[,c(6:33)])
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = studentSurvey[, c(6:33)])
Overall MSA =  0.99
MSA for each item = 
  Q1   Q2   Q3   Q4   Q5   Q6   Q7   Q8   Q9  Q10  Q11  Q12  Q13  Q14  Q15  Q16  Q17  Q18  Q19  Q20  Q21  Q22  Q23  Q24 
0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.99 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.98 0.99 0.99 
 Q25  Q26  Q27  Q28 
0.99 0.99 0.99 0.99 
Subhasis >cortest.bartlett(cor(studentSurvey[,c(6:33)]),n=5820)
$chisq
[1] 297891.4

$p.value
[1] 0

$df
[1] 378
 In the Bartlett test, correlation matrix was supplied with sample size as 5820 (total number of respondents). Both KMO and Bartlett test suggest that factor analysis is applicable for the dataset. So, I can proceed with factor analysis. R is capable of running factor analysis using different methods such as "Maximum Likelihood", "Minimum Residual", "Principal Axis", "Weighted Least Square", "Generalized Weighted Least Square" etc. For the current study, I shall use the most frequently used procedure i.e. "Principal Axis" method. Principal axis method is very similar to principal component analysis with pre-specified priors which is a matrix of squared multiple correlation among variables. A common problem in factor analysis is deciding the number of factors to retain. And there are different ways of addressing the issue. Factors can be extracted using minimum eigen value criteria, minimum variance explained criteria, plotting scree plot or by running parallel analysis. Scree plot is particularly helpful when a distinct elbow can be seen in the plot. However, if the plot is quite smooth, parallel analysis could be helpful. In this analysis I have run parallel analysis as shown in the code below.
Subhasis >fa.parallel(studentSurvey[,c(6:33)],fm="pa")
Parallel analysis suggests that the number of factors =  5  and the number of components =  2

As per the analysis, there are 5 factors but only 2 components. fa.parallel() function gives scree plot also by default. The scree plot show that there is a sharp elbow at n=2 (i.e. 2 factors). Hence, factor analysis can be done two times to get a better picture (with n=2 and n=5 respectively). The codes and outputs are given below.
Subhasis >fit1=fa(studentSurvey[,c(6:33)],fm="pa",nfactors=2,rotate="varimax")
Subhasis >fit1
Factor Analysis using method =  pa
Call: fa(r = studentSurvey[, c(6:33)], nfactors = 2, rotate = "varimax", 
    fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
     PA1  PA2   h2    u2 com
Q1  0.36 0.80 0.76 0.238 1.4
Q2  0.47 0.79 0.85 0.152 1.6
Q3  0.55 0.70 0.80 0.200 1.9
Q4  0.46 0.79 0.83 0.171 1.6
Q5  0.50 0.80 0.88 0.119 1.7
Q6  0.49 0.77 0.84 0.157 1.7
Q7  0.46 0.82 0.88 0.121 1.6
Q8  0.45 0.81 0.86 0.135 1.6
Q9  0.54 0.70 0.78 0.215 1.9
Q10 0.52 0.79 0.90 0.101 1.7
Q11 0.56 0.69 0.78 0.217 1.9
Q12 0.48 0.75 0.80 0.200 1.7
Q13 0.75 0.56 0.88 0.122 1.9
Q14 0.79 0.52 0.90 0.101 1.7
Q15 0.79 0.52 0.89 0.108 1.7
Q16 0.70 0.61 0.87 0.130 2.0
Q17 0.82 0.41 0.83 0.166 1.5
Q18 0.76 0.54 0.87 0.127 1.8
Q19 0.79 0.52 0.89 0.108 1.7
Q20 0.82 0.49 0.90 0.096 1.6
Q21 0.83 0.46 0.91 0.092 1.6
Q22 0.84 0.46 0.91 0.089 1.6
Q23 0.75 0.57 0.89 0.111 1.9
Q24 0.71 0.60 0.86 0.141 1.9
Q25 0.82 0.47 0.90 0.102 1.6
Q26 0.75 0.55 0.86 0.142 1.8
Q27 0.69 0.57 0.81 0.194 1.9
Q28 0.81 0.46 0.86 0.137 1.6

                        PA1   PA2
SS loadings           12.54 11.47
Proportion Var         0.45  0.41
Cumulative Var         0.45  0.86
Proportion Explained   0.52  0.48
Cumulative Proportion  0.52  1.00

Mean item complexity =  1.7
Test of the hypothesis that 2 factors are sufficient.

The degrees of freedom for the null model are  378  and the objective function was  51.28 with Chi Square of  297891.4
The degrees of freedom for the model are 323  and the objective function was  2.99 

The root mean square of the residuals (RMSR) is  0.01 
The df corrected root mean square of the residuals is  0.02 

The harmonic number of observations is  5820 with the empirical chi square  946.46  with prob <  1.7e-62 
The total number of observations was  5820  with MLE Chi Square =  17339.96  with prob <  0 

Tucker Lewis Index of factoring reliability =  0.933
RMSEA index =  0.095  and the 90 % confidence intervals are  0.094 0.096
BIC =  14539.85
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                PA1  PA2
Correlation of scores with factors             0.97 0.97
Multiple R square of scores with factors       0.95 0.94
Minimum correlation of possible factor scores  0.90 0.87
As per the output shown above, cumulative variance explained is 86% by these 2 factors. For almost all statistical purposes, if two factors can explain 86% of total variance, the factor model can be considered to be a very good model. Based on factor loading scores, it can be said that Q1 to Q12 are pointing towards factor 2 and Q13 to Q28 points towards factor1. Factor loading are nothing but correlation of each question (or item) with the respective factors. In the above analysis, if 'nfactors' is made equal to 5, a five factor model will come up. However, it will be seen that the factor loading of all the items with the 3rd, 4th and 5th factor are very low (viewers can see it for themselves). There is one more thing which is to be seen in factor analysis. This is called communality. Communality value shows how much variance of individual item's is explained by the extracted factors (in percentage). If, for any variable, the communality value is low (less than 0.4), that variable can be dropped from the analysis. the communality scores are given below.
Subhasis >fit1$communality
       Q1        Q2        Q3        Q4        Q5        Q6        Q7        Q8 
0.7623041 0.8476720 0.8002198 0.8289699 0.8805149 0.8426552 0.8794662 0.8649415 
       Q9       Q10       Q11       Q12       Q13       Q14       Q15       Q16 
0.7849237 0.8987801 0.7828362 0.8003019 0.8780396 0.8987789 0.8923352 0.8700809 
      Q17       Q18       Q19       Q20       Q21       Q22       Q23       Q24 
0.8337139 0.8732333 0.8918975 0.9040908 0.9081967 0.9105020 0.8891574 0.8591331 
      Q25       Q26       Q27       Q28 
0.8983235 0.8575586 0.8059180 0.8634799

Since, communality scores of each variable is above 0.75, it can be said that all the variables are quite important in this analysis. The next thing to be done is to identify the factors properly. This is somewhat tricky in many situation because exploratory factor analysis often combine variables without any meaningful logical explanation. But in this case, identification of factors is comparatively easier. The first factor is talking about the instructor and the second factor is talking about the course content. Hence, I can say that the first factor is "Instructor Evaluation" and the second factor is "Course Evaluation".

SAS has PROC FACTOR which does the same analysis and produces report-ready outputs. The codes are given below.
proc factor data=subhasis.turkiyestudentdata 
   method=principal
   nfactors=5
   plots=scree
   rotate=varimax;
var Q1--Q28;
run;
Number of factors extracted using this code is 5 so that readers can compare the 2 factors model with the 5 factors model. The output is not shown in this post and readers are requested to run the codes to get the desired results. The scree plot can also be seen in this output.

This is all about factor analysis in a very compact format.

Comments are welcome.

No comments:

Post a Comment

EM Algorithm and its usage (Part 2) EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is q...