Friday, 12 September 2014

Binary Logistic Regression

Introduction

Linear Regression model uses Ordinary Least Square (OLS) technique and it is considered as Best Linear Unbiased Estimator (BLUE) of model parameters. That is why, OLS is a preferred technique in fitting models. However, OLS has its own limitations. In real world scenario, there are several situations when the baseline relationship between dependent and independent variables is not linear at all. In marketing, there is a concept called saturation level of product penetration. Frank Bass in 1969 published a paper which described a model (known as Bass Diffusion Model) which got very popular because of its ability to predict saturation level of a new product in future. This model is, essentially, a non linear model and to make it usable, a linear approximation of the same model was proposed which uses OLS for parameter estimation. In another case, Keynesian Good Market Model is also non-linear and those non-linearities have prominent impact on the market dynamics. As more and more complexities are introduced in the model, simple linear relationship become overly restrictive in analyzing the situations. Linear regression model is good for forecasting and inferences but it fails completely when the dependent variable becomes categorical in nature. When the dependent variable become categorical in nature, the problem in hand becomes a classification problem and, to handle the situation, logit function or probit function is used. And, the same regression analysis becomes logistic regression.

Basic Concepts

For simplicity, I shall discuss the 'Binary Logistic Regression' and 'Multinomial Logistic Regression' is just an extension of the same concept. Imagine that dependent variable Y can take only two values, i.e, 0 and 1. This {0,1} binary nature is very prominent in marketing (churn prediction) and banking (defaulter prediction) sectors. Linear regression cannot be used directly because, in that case, Y can take any numeric value (real value) which is meaningless (Y can have only two discrete values). An alternative to this situation is to model probability of occurrence of discrete event because probability is numeric in nature. However, there is another problem. Using simple linear regression, even the probability values can go beyond 1 which is unrealistic. Moreover, the error terms will never be homoscedastic with this approach. To rectify this problem, ratio of event to nonevent is taken into consideration which is called the 'odd ratio'. Odd ratio can take any value between 0 and infinity and it is in ratio scale. This odd ratio can be modeled as an exponential series as shown below.

 \(\frac { P }{ 1-P } ={ e }^{ { \beta  }_{ 0 }+{ \beta  }_{ 1 }{ x }_{ 1 }+{ \beta  }_{ 2 }{ x }_{ 2 }+{ \beta  }_{ 3 }{ x }_{ 3 }+...+{ \beta  }_{ n }{ x }_{ n } }\)

Where P and (1-P) are probabilities of events and nonevents. ${ x }_{ 1 }$,${ x }_{ 2 }$,${ x }_{ 3 }$ etc are independent variables. ${ \beta  }_{ 0 }$, ${ \beta  }_{ 1 }$, ${ \beta  }_{ 2 }$ etc are the coefficients (or parameters). Taking log on both the side makes the right hand side of the equation a linear combination of parameters and independent variables and log(P/1-P) is still numeric in nature. Thus, a nonlinear equation is converted into a linear equation through log transformation as shown below.

\(log(\frac { P }{ 1-P } )={ \beta  }_{ 0 }+{ \beta  }_{ 1 }{ x }_{ 1 }+{ \beta  }_{ 2 }{ x }_{ 2 }+{ \beta  }_{ 3 }{ x }_{ 3 }+...+{ \beta  }_{ n }{ x }_{ n }\)

Log(P/1-P) is called the link function and the parameters are estimated using maximum likelihood method. Unlike linear regression, logistic regression can handle both numeric and categorical data as independent variables and hence it is more versatile than its competitor discriminant analysis. Though the nonlinear model is transformed into a linear model, interpretation of the outcome is not as straight forward as in the case of linear regression model. There are quite a few things required to be understood before finalizing the model. Those details are given in this link. From here I shall move on to the model building exercise.

Binary Logistic Regression with R

To do binary logistic regression I am going to use the 'obese' dataset which can be downloaded from this link. The dataset contains 4690 cases and 8 attributes. The last attribute is 'obese' which is binary in nature. Obesity could be a disease and it can be influenced by many factors. Some of the factors are incorporated in the dataset and our objective is to see if obesity can be predicted on the basis of other attributes available in the dataset. For the current analysis, I downloaded the .dta file (stata) file which can be easily read by R using the 'foreign' library. An important aspect of logistic regression is that the parameters are estimated using maximum likelihood method and this method fails if there is a complete separation of data with respect to any independent variable. This dataset has one such variable which prevents parameter estimation using MLE. I came to know about this after running glm() to develop the model. Body Mass Index (attribute bmi in the dataset) is responsible for complete separation of data in this situation. The dataset was given a name 'obese' and using tapply() function the following output was got. Clearly, maximum of 'bmi' in group 0 is less than minimum of 'bmi' in group 1. Thus, a complete separation of dataset has taken place with respect to the attribute 'bmi'.
Subhasis >with(obese,tapply(bmi,obese,min))
   0    1 
16.2 30.0 
Subhasis >with(obese,tapply(bmi,obese,max))
   0    1 
29.9 57.6

That is why, for time being, this attribute is not considered for building the model. Classification modeling needs training set and test set to judge the accuracy of the model built. Technically, the training set and test sets are independent of each other. Training set is used to build the model whereas test set is used for validating the model. In linear regression model we get $R^2$ values but the same value has no meaning in logistic regression. Rather, a confusion matrix or a classification matrix is more desirable in this situation. R has a library called 'caret' (Classification and Regression Training) which is capable of doing several things as far as model building and model validation is concerned. But before using this library a simple model is built using the following code.
obese=read.dta(file.choose()) # Choose the file location for data loading
obese=obese[,c(-6,-7)]        # Variable 6 and 7 i.e. 'bmi' and 'id' are removed
obese=na.omit(obese)          # Missing value cases are removed
obese$obese=as.factor(obese$obese)     # Predictor variable 'obese' is converted to factors
glmModel0=glm(obese~.,data=obese,family=binomial(logit))    # Logistic regression is run by using glm() function with 'logit' as link function

Output of the model so formed is shown below. Clearly, not all the predictor variables are significant in terms of building the model. Moreover, Hosmer-Lemeshow test of goodness of fit also shows a bad fit of the model. Goodness of Fit hypothesis is rejected at 5% significance level (p-value 0.004023). Hence, the model is modified by removing the non-significant attributes.
Subhasis >glmModel0

Call:  glm(formula = obese ~ ., family = binomial, data = obese)

Coefficients:
(Intercept)     sexWomen          sbp          dbp          scl          age  
  -7.487025     0.321417     0.002344     0.047856     0.002164     0.009684  

Degrees of Freedom: 4657 Total (i.e. Null);  4652 Residual
Null Deviance:     3563 
Residual Deviance: 3271  AIC: 3283

Subhasis >summary(glmModel0)

Call:
glm(formula = obese ~ ., family = binomial, data = obese)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8129  -0.5389  -0.4274  -0.3306   2.5966  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.487025   0.397276 -18.846  < 2e-16 ***
sexWomen     0.321417   0.095026   3.382 0.000718 ***
sbp          0.002344   0.003179   0.737 0.460842    
dbp          0.047856   0.005721   8.366  < 2e-16 ***
scl          0.002164   0.001023   2.116 0.034329 *  
age          0.009684   0.006141   1.577 0.114819    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3563.1  on 4657  degrees of freedom
Residual deviance: 3270.5  on 4652  degrees of freedom
AIC: 3282.5

Number of Fisher Scoring iterations: 5

Subhasis >hoslem.test(glmModel0$y,fitted(glmModel0),g=10)

 Hosmer and Lemeshow goodness of fit (GOF) test

data:  glmModel0$y, fitted(glmModel0)
X-squared = 22.5298, df = 8, p-value = 0.004023
hoslem.test() is available in 'ResourceSelection' library and it is a convenient function to judge goodness of fit in case of logistic regression. In the next model, only three predictor variables are used which are found to be significant in the above output.  
Subhasis >glmModel1

Call:  glm(formula = obese ~ sex + dbp + scl, family = binomial(logit), 
    data = obese)

Coefficients:
(Intercept)     sexWomen          dbp          scl  
  -7.248784     0.349211     0.052787     0.002593  

Degrees of Freedom: 4657 Total (i.e. Null);  4654 Residual
Null Deviance:     3563 
Residual Deviance: 3275  AIC: 3283

Subhasis >summary(glmModel1)

Call:
glm(formula = obese ~ sex + dbp + scl, family = binomial(logit), 
    data = obese)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8246  -0.5404  -0.4273  -0.3303   2.5966  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.2487845  0.3629444 -19.972  < 2e-16 ***
sexWomen     0.3492108  0.0935725   3.732  0.00019 ***
dbp          0.0527871  0.0034276  15.400  < 2e-16 ***
scl          0.0025926  0.0009951   2.605  0.00918 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3563.1  on 4657  degrees of freedom
Residual deviance: 3274.9  on 4654  degrees of freedom
AIC: 3282.9

Number of Fisher Scoring iterations: 5

Subhasis >hoslem.test(glmModel1$y,fitted(glmModel1),g=10)

 Hosmer and Lemeshow goodness of fit (GOF) test

data:  glmModel1$y, fitted(glmModel1)
X-squared = 14.4191, df = 8, p-value = 0.07148

As per the above output, the fit statistic is now within 95% confidence level and a better model is found out. All the predictor variables are significant at 1% significant level. A likelihood ratio test of the model can be run using lrtest() function under 'lmtest' library. It conducts a likelihood ratio test of the built model with respect to a reduced model (having only intercept) and produces chi-suare value and associated p-value. The same function can be used to compare two different models as well. Odds of each predictor variables and their confidence intervals are shown below for the fitted model.
Subhasis >exp(coef(glmModel1))
 (Intercept)     sexWomen          dbp          scl 
0.0007110382 1.4179480726 1.0542051431 1.0025959306 

Subhasis >exp(confint.default(glmModel1))
                   2.5 %      97.5 %
(Intercept) 0.0003491034 0.001448211
sexWomen    1.1803513099 1.703371462
dbp         1.0471466961 1.061311169
scl         1.0006424864 1.004553188
As per the above outputs, if 'dbp' increases by one unit (keeping other variable constant) the odd of getting obsessed will increase by 5.4%. It also suggests that women are more likely to get obesity. 

The above analysis is done considering the entire dataset. However, in most of the cases, models are developed and then tested for validity and accuracy. This requires splitting the dataset into training and testing sets. I shall demonstrate how using 'caret' package the same can be achieved. Then the confusion matrix will be generated to asses model's accuracy. The following codes will do the splitting and 10 folds cross validation (repeated 3 times each) to get the final model along with ROC statistics.
obese=obese[,c(-6,-7)]
obese=na.omit(obese)
obese$obese=ifelse(obese$obese==0,"false","true")
obese=obese[(order(obese$obese)),]
obese$obese=as.factor(obese$obese)

train=createDataPartition(y=obese$obese,p=0.75,list=F)
trainingData=obese1[train,]
testData=obese1[-train,]

control=trainControl(method="repeatedcv",
                     number=10,
                     repeats=3,
                     classProbs=T,
                     summaryFunction=twoClassSummary)

modelfit=train(obese~sex+dbp+scl,
               data=trainingData,
               method="glm",
               metric="ROC",
               trControl=control)

modelfit
summary(modelfit)
The output is shown below. ROC value shows that almost 70% of the entire area is under the curve. If a classification model does no better than a random guess, the value will lie close to 50%. Hence, the trained model is doing better than random guess. Look at the coefficients. They are different from the earlier model 'glmModel1'. However, the differences are not very high. The estimates look stable. 
Subhasis >modelfit
Generalized Linear Model 

3494 samples
   5 predictors
   2 classes: 'false', 'true' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 

Summary of sample sizes: 3144, 3144, 3145, 3145, 3144, 3144, ... 

Resampling results

  ROC    Sens   Spec    ROC SD  Sens SD  Spec SD
  0.708  0.993  0.0276  0.0404  0.00414  0.0201 

 
Subhasis >summary(modelfit)

Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8343  -0.5412  -0.4289  -0.3291   2.6013  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.614933   0.453816 -16.780  < 2e-16 ***
sex          0.362030   0.107951   3.354 0.000798 ***
dbp          0.053046   0.003978  13.335  < 2e-16 ***
scl          0.002472   0.001138   2.171 0.029895 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2672.5  on 3493  degrees of freedom
Residual deviance: 2457.2  on 3490  degrees of freedom
AIC: 2465.2

Number of Fisher Scoring iterations: 5
So far so good! But is the model really that good? To get the answer, we need to use the model to make predictions based on the test data. And after that, a confusion matrix is required to be built to asses model validity. Predictions based on the model and the confusion matrix along with other statistics are shown below. Even though, accuracy level is above 85%, the model is failing miserably in predicting obesity (i.e. obese='true') based on the predictor variables. 
Subhasis >glmPredict=predict(modelfit,testData)

Subhasis >confusionMatrix(glmPredict,testData$obese)

Confusion Matrix and Statistics

          Reference
Prediction false true
     false  1003  143
     true     12    6
                                         
               Accuracy : 0.8668         
                 95% CI : (0.846, 0.8858)
    No Information Rate : 0.872          
    P-Value [Acc > NIR] : 0.7182         
                                         
                  Kappa : 0.0455         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.98818        
            Specificity : 0.04027        
         Pos Pred Value : 0.87522        
         Neg Pred Value : 0.33333        
             Prevalence : 0.87199        
         Detection Rate : 0.86168        
   Detection Prevalence : 0.98454        
      Balanced Accuracy : 0.51422        
                                         
       'Positive' Class : false


Kappa value is more appropriate when class levels are unbalanced (skewed frequencies of class levels) and the Kappa value is very low in the above output. Thus, binary logistic regression is probably not a good classification model for the dataset in hand. Somewhat better result may be expected if Naive Bayesian classification modeling is used. But, that will be dealt in another post.

If you have any suggestions, do post your comments.


2 comments:

  1. It’s a shame you don’t have a donate button! I’d certainly donate to this brilliant blog! I suppose for now I’ll settle for book-marking and adding your RSS feed to my Google account. I look forward to fresh updates and will talk about this blog with my Facebook group. Chat soon!
    Surya Informatics

    ReplyDelete
  2. [ ] I'm here to testify about the great work Dr Osebor did for me. I have been suffering from (HERPES) disease for the past 5 years and had constant pain, especially in my knees. During the first year,I had faith in God that i would be healed someday.This disease started circulating all over my body and i have been taking treatment from my doctor, few weeks ago I came across a testimony of one lady on the internet testifying about a Man called Dr Osebor on how he cured her from HIV Virus. And she also gave the email address of this man and advise anybody to contact Dr Osebor for help for any kind of sickness that he would be of help, so I emailed him on ( oseborwinbacktemple@gmail.com ) telling him about my (HERPES Virus) he told me not to worry that i was going to be cured!! Well i never believed it,, well after all the procedures and remedy given to me by this man few weeks later i started experiencing changes all over me as Dr Osebor assured me that i will be cured,after some time i went to my doctor to confirmed if i have be finally healed behold it was TRUE, So 
    - [ ] friends my advise is if you have such sickness or any other at all you can contact Dr Osebor via email. { oseborwinbacktemple@gmail.com }or call or what sapp him on( +2348073245515 )
    - [ ] DR osebor CAN AS WELL CURE THE FOLLOWING DISEASE:-
    - [ ]  HIV/AIDS
    - [ ]  HERPES
    - [ ] CANCER
    - [ ] ALS
    - [ ] cancer 
    - [ ] Diabetes
    eye problem etc.

    ReplyDelete

Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do...