Processing math: 100%

Thursday, 10 April 2025

Let us understand Logistic Regression (Part 1)

Introduction

Logistic regression, as the name suggests, deals with 1) the logit function and 2) regression. To many people, this algorithm's name is rather confusing because the name contains the term regression, whereas the algorithm actually does classification. In this post, I will provide some details related to this algorithm and explain how the name came into existence.

Logistic regression is essentially binomial in nature, i.e., it can be used for binary classification only. However, with some smart tweaks, the same algorithm can also be used for multi-class classification. I am not going to discuss the multi-class classification in this post. It will be discussed in another post. So let us start with the basics. Below you can see the regression line for two separate cases, i.e., regression and classification. In part (a), there is no problem, but in the case of part (b) (the classification case), there are quite a few problems. 


Problem 1: Assumption of normality of residuals

    For linear regression, an important assumption is the normality of the residuals. Without this assumption, the estimated parameters will become biased. For the classification task, the target variable has only 2 values, i.e., either 0 or 1. Hence, the residuals are more likely to deviate from the normal distribution.

Problem 2: Assumption of homoskedasticity of error variance

    Another important assumption is the homoskedasticity of error variance for the levels of the predictor variables. If this assumption is not valid, the model cannot be generalized for different levels of the input variables. This will also result in the estimation of an incorrect confidence interval. As the target variable has only two values, the errors are also guaranteed to be heteroskedastic in nature.

Problem 3: Predictions are unbounded

    A linear regression line is unbounded and hence not restricted to be bounded between 0 and 1. This problem is, however solvable by adopting a clipping mechanism. That is, if any value goes beyond 1, then it will be brought back to 1 and if any value goes below 0, it will be clipped to 0. This will not solve the problems associated with the assumptions of linear regression.

A typical error analysis plot for the classification data is given below for getting a better understanding. The residuals look normally distributed, but at both ends they deviated. The error variance is also very different from white noise (homoskedastic nature). Hence, it is for sure that linear regression is highly unsuitable for classification tasks.

Finding solution

Probability of success (corresponding to target value 1) can range from 0 to 1, and there can be, theoretically, an infinite number of probabilities between these two values. This is good for a regression task. However, the probability is bound between 0 and 1. As stated in Problem 3, regression outcomes are unbounded, and hence there is a need to extend this binding nature of probability to some other entity which is unbounded. One option is to look for a ratio, i.e., P(success)P(failure)=P(success)1P(success).  So, as P(success)0, P(success)P(failure). This is a partial solution because this ratio can never go beyond 0. The solution is to take the logarithm of this ratio so that when P(success)<0.5, log(P(success)P(failure))<0 and this value can stretch from (when P(success)=0) to (when P(success)=1). Not only this, the new entity is purely continuous in nature within (,). This new entity is the log of odd ratio and the entity P(success)P(failure) is called the odd ratio. So, suppose the log of odd ratio is modelled as a linear combination of the features. What we have in our hand is a model: log(P(success)P(failure))=β0+β1X1+β2X2+...+βmXm

The equation looks like a linear regression model which tries to predict log of odd ratio using linear combination of the features. This is why the model is named logistic regression.

The next challenge is to estimate the β values. The problem is, P(success) is not available in the dataset. The target values cannot be considered as P(success) because in that case, log(P(success)P(failure)) would be either or . This is where some more calculations are required, and depending on the method chosen, β values are estimated. In this post, I shall discuss the process of Maximum Likelihood Estimation (MLE). The reader can get a detailed coverage of this process in my post on MLE.

Maximum Likelihood Estimation of β

Since P(success) is unknown, a different route is required to find the β values. Let us assume Z=β0+β1X1+β2X2+...+βmXm. This would mean, log(P(success)P(failure))=Z. For simplicity, let us assume P(success)=p. Then, log(p1p)=β0+β1X1+β2X2+...+βmXm=Z

. After a simple calculation, the expression of p becomes: p=ez1+ez. ez1+ez is the logit function and thus, the probability of success is modelled as the logit function. The next thing is to use the concept of MLE, and for this, an assumption of distribution is needed. To use MLE, it is assumed that the outcomes of the target variable follow a Bernoulli distribution. If yi is the outcome of the ith data point, yi{0,1}. Thus, probability mass function associated with the ith data point is: P(Xi|β0,β1,...βm)=pyii(1pi)1yi
Assuming that the data points are all following i.i.d condition, the likelihood function is defined as: L=ni=1pyii(1pi)1yi
For practical purposes and mathematical convenience, we work with log of this likelihood function. The log-likelihood funtion is given as: l=log(L)=ni=1[yilog(pi)+(1yi)log(1pi)]
The expression within the square brackets is also known as the binary cross entropy, and, it is a popular loss function which can be used for parameter optimisation using different iterative optimisation methods such as Gradient Descent, L-BFGS or Newton's Method (just to name a few!). Clearly, l is a function of p, which is a function of βs. Hence, to optimise (say) βk (the parameter associated with the kth variable) , we need to find the expression of lβk and equate it to 0. Mathematically, lβk=ni=1yilog(pi)βk+ni=1[(1yi)log(1pi)βk]=0
After putting the value of pi=ezi1+ezi in the above expression and doing some further calculations it will be seen that lβk=ni=1[yixikpixik]=0
xik is the kth variable in the ith data point. There will be m+1 number of such equations where m is the number of features in the dataset. This final equation is not a closed form equation for βk and hence, we need to apply numerical methods to estimate the value of βk. Newton Raphson's method is quite popular to find the values of βk. However, other methods are also there for this purpose. Another important thing to note here is that the Hessian matrix need to be -ve definite to ensure that the likelihood value is maximized at the estimated values of βk. If we differentiate lβk again with respect to βr (β associated with the rth variable), lβkβr=ni=1xikpi(1pi)xir
In the matrix format the Hessian matrix will take the form H=XTΣX
where Σ is a diagonal matrix having diagonal elements pi(1pi)[0,0.25]. This matrix is positive definite. Hence, as long as the rank of the matrix X is not rank deficient, XTΣX is -ve definite. Thus, parameters estimated using MLE for logistic regression is going to maximize the likelihood of observing the data. The below graph shows the logit function after fitting the data using logistic regression.

Conclusion

In this post I have discussed why logistic regression got its name and the very working of MLE to estimate the parameters of this model. Since Z is a linear combination of predictor variables, and the predictor variables are not raised to any other power (other than 1), the discission boundary of logistic regression is linear. Logistic regression falls under linear model. However, with the adaptation of kernels, it is possible to make logistic regression map nonlinearity. I shall discuss that in a separate post. 


No comments:

Post a Comment

Let us understand Logistic Regression (Part 2)

This is the second blog post on Logistic Regression. The first post on this topic is available here . The first post discussed the formulati...