Introduction
Regression analysis is not new but it is still one of the most frequently used data mining tools in applied statistics. Various theories and concepts are available in the literature on how to use this modeling technique in predictive analytics. This blog post will try to cover various aspects of regression analysis without going into the technicalities of the tool. I shall demonstrate regression analysis on a dataset found here (description of the data can be seen here) and will try to explain steps involved while developing models. Both R and SAS University Edition will be used for the analysis (though no comparative study will be made between R and SAS) and outputs will be analyzed systematically. Regression analysis is based on statistical theories and it is a parametric method. Hence, before moving ahead, it is important that we have a closer look at various assumptions on which this modeling technique is based on.
Assumptions
There are 4 critical assumptions on which this modeling technique is based on. These assumptions are:
Relationship between dependent and independent variables is linear: This is an important assumption (and quite restrictive also). However, it is to be understood that linearity in regression analysis essentially means linearity with respect to parameters (β) and not variables. To make this point clear, both the equations shown below are linear.
The squared terms can be considered as new variables after square transformations. But still the equation will remain linear. This is not the case when I consider the equation shown below.
P stands for probability and Y is as shown above. This is a non-linear equation with respect to parameters (β) because there is no way P can be expressed as a linear combination of β and X. However, log(P/1-P) is linear and this (P/1-P) is called odd ratio which is used in logistic regression which will be discussed in other posts. Typically, non-linearity is found out using standardized residual plots which are available in almost all statistical packages.
Independent variables must be uncorrelated: This is another important assumption in linear regression. Correlated variables tend to inflate R-sq value and can render a model potentially unfit for deployment. This phenomenon is called multicollinearity. Hence, care must be taken while choosing variables. One of the ways of detecting multicollinearity is observing the correlation matrix but with large number of independent variables, this method becomes somewhat cumbersome (and annoying!). A better method is observing Variance Inflation Factor (VIF) for individual IVs. Higher the value of VIF, higher is the multicollinearity. Generally, most of the statistical packages have this option available for diagnosis of multicollinearity.
Homoscedasticity: Homoscedasticity means equal variance of error across all levels of independent variables. If this assumption is not met, the situation become heteroscedastic. Slight deviation from homoscedasticity does not influence the outcome much, but with the increase of heteroscedasticity, the model becomes weak as the chances of committing type 1 error increases leading to distortion of actual findings. SAS uses HCCMETHOD= option to test heteroscedasticity under PROC REG procedure and R has Breusch-Pagan test under "lmtest" library which tests homoscedasticity.
Errors are normally distributed: In regression analysis (linear regression) it is assumed that both dependent variable (DV) and independent variables (IVs) are numeric in nature and they are normally distributed. However, the analysis can also be done if the data are not normally distributed but the residuals (or error terms) must be normally distributed. If data are highly skewed (or high in kurtosis), suitable transformations are required to be made before feeding the data into the model. Dataset may contain outliers and they can be identified through various methods. However, removal of outliers is not always a good option. Generally, normality of residual is seen through P-P or Q-Q plots. Both R ans SAS are capable of testing normality of variables separately and I will show how it is to be done.
Modeling with R
Data is imported into the R environment using read.table() function. It is not that easy job as the dataset contains delimiters within the data values. As a workaround, those delimiters are replaced by "-" and then read.table() function is used. The variables are given the following names:
Name Miles Speed Hours Population TotOpCost RevPerTonMile LoadFactor Capacity Asset Investment AdjAsset.
A new variable is created by taking ratio of TotOpCost to RevPerTonMile which will act as the dependent variable. As suggested in the description of data (link is given above), this new DV will be regressed with respect to 7 independent variables. But before going ahead, it will be a good idea to look at the scatter plot of numeric data. (Click on the images to get a larger views)
It can be seen that due to one data point, the graphs are very much skewed. This data point is an outlier. An outlier can be a legitimate point and hence, deleting outlier is not a good option without studying it. In regression analysis, there is a concept of leverage point attached to each data point. A leverage point can be either good or bad. If the outlier is a bad leverage point, the analysis can be done after removing that point. We will see more about leverage points at a later stage of analysis. But, for time being, we will see how the variables are distributed using Q-Q Plot.
As it can be seen, not a single variable is normally distributed. Since, data are very skewed, log transformation of data can be a good option. The fig below shows the Q-Q plots after log transformation.
Data are still not normal even after log transformation. However, there are improvements in terms of distribution of data except in case of LoadFactor variable (its distribution got more skewed). Hence, for building the model, it is suggested that except one variable (LoadFactor) all the other variables are to be log transformed. The linear model which is built is as shown below.
However, before running the analysis, it is also useful to see the correlation matrix to understand how the dependent variable is related to independent variables. After doing suitable transformations, the correlation matrix is generated and the associated p-values are also calculated. The outputs are shown below. Variable names are not changed but they are already log transformed (view the code).
Very high level of correlations are seen among dependent and independent variables. The associated p-values are also much lower than 0.05 suggesting that these correlations are not occurring by chance. However, a closer look reveals that the IVs are not independent of each other. Correlated variables tend to inflate the R-sq values and that is why VIF of each IV is required to be seen after running the regression analysis. Model output and VFI values of the initial model are shown below.
The R-sq value is very high and it shows that the model is explaining around 98% of the variance of dependent variable. The associated p-value is also very low suggesting that the R-sq value is significantly different from zero. However, VIF scores suggest that log(AdjAsset) is heavily correlated with other IVs and so is the case with log(Miles) and log(Speed). Usually VIF scores above 10 are not acceptable and in a conservative way VIF score above 5 is also considered as threshold value. Moreover, the p-value associated with the IV log(AdjAsset) is much above 0.05 suggesting that this variable can be dropped without loss of predictive power. Hence, the model is rerun after discarding 'log(AdjAsset)" variable. The output is shown below.
There is no much of drop in the R-sq value even after dropping the variable. But the VIF score and p-value associated with the estimate of ln(Miles) suggest that this variable can also be dropped without much of deterioration of the predictive power of the model. Similar is the case with the IV log(Speed). Hence, the analysis is rerun after discarding both the variables. The output is shown below.
The output is interesting as R-sq value is still very high, all the parameter estimates are significant at 5% level and VIF score associated with each IV is below 5. However, the analysis is still not over. Normality and Homoscedasticity of residuals are yet to be verified. R produces four regression model diagnostic plots as shown below.
To test homoscedasticity of residuals, bptest() (Stands for Breusch-Pagan test) is run and to test normality of residuals, ad.test() (Stands for Anderson-Darling test) is run. Both the tests failed to reject the null hypothesis and hence we can consider the last model as the fit model. The output of the two tests are shown below.
Case no 30 is an outlier but as per the diagnostic plots, it is a good leverage point as it is not distorting the result. Hence, removal of that point is not necessary. Hence, the final model is
Codes and outputs
SAS is equally capable of doing similar analysis. As the process has already explained above, in this section I am going to provide the outputs only. Readers are requested to click on the links to get the output.
R and SAS Codes
Please feel free to put your comments/suggestions.
Really Helpful Sir !!!
ReplyDeleteThank you Pankaj!
Deleteseria mas útil si describiera como hacer el análisis de cada grafica
ReplyDelete