Analytics for Everyone: Let us understand Logistic Regression (Part 1)

Introduction

Logistic regression, as the name suggests, deals with 1) the logit function and 2) regression. To many people, this algorithm's name is rather confusing because the name contains the term regression, whereas the algorithm actually does classification. In this post, I will provide some details related to this algorithm and explain how the name came into existence.

Logistic regression is essentially binomial in nature, i.e., it can be used for binary classification only. However, with some smart tweaks, the same algorithm can also be used for multi-class classification. I am not going to discuss the multi-class classification in this post. It will be discussed in another post. So let us start with the basics. Below you can see the regression line for two separate cases, i.e., regression and classification. In part (a), there is no problem, but in the case of part (b) (the classification case), there are quite a few problems.

Problem 1: Assumption of normality of residuals

For linear regression, an important assumption is the normality of the residuals. Without this assumption, the estimated parameters will become biased. For the classification task, the target variable has only 2 values, i.e., either 0 or 1. Hence, the residuals are more likely to deviate from the normal distribution.

Problem 2: Assumption of homoskedasticity of error variance

Another important assumption is the homoskedasticity of error variance for the levels of the predictor variables. If this assumption is not valid, the model cannot be generalized for different levels of the input variables. This will also result in the estimation of an incorrect confidence interval. As the target variable has only two values, the errors are also guaranteed to be heteroskedastic in nature.

Problem 3: Predictions are unbounded

A linear regression line is unbounded and hence not restricted to be bounded between 0 and 1. This problem is, however solvable by adopting a clipping mechanism. That is, if any value goes beyond 1, then it will be brought back to 1 and if any value goes below 0, it will be clipped to 0. This will not solve the problems associated with the assumptions of linear regression.

A typical error analysis plot for the classification data is given below for getting a better understanding. The residuals look normally distributed, but at both ends they deviated. The error variance is also very different from white noise (homoskedastic nature). Hence, it is for sure that linear regression is highly unsuitable for classification tasks.

Finding solution

Probability of success (corresponding to target value 1) can range from 0 to 1, and there can be, theoretically, an infinite number of probabilities between these two values. This is good for a regression task. However, the probability is bound between 0 and 1. As stated in Problem 3, regression outcomes are unbounded, and hence there is a need to extend this binding nature of probability to some other entity which is unbounded. One option is to look for a ratio, i.e., $\frac{P(success)}{P(failure)}=\frac{P(success)}{1-P(success)}$. So, as $P(success) \to 0$, $\frac{P(success)}{P(failure)} \to \infty$. This is a partial solution because this ratio can never go beyond 0. The solution is to take the logarithm of this ratio so that when $P(success) < 0.5$, $log(\frac{P(success)}{P(failure)}) < 0$ and this value can stretch from $-\infty$ (when $P(success) = 0$) to $\infty$ (when $P(success) = 1$). Not only this, the new entity is purely continuous in nature within $(-\infty, \infty)$. This new entity is the $log\ of\ odd\ ratio$ and the entity $\frac{P(success)}{P(failure)}$ is called the odd ratio. So, suppose the log of odd ratio is modelled as a linear combination of the features. What we have in our hand is a model: $$log\left (\frac{P(success)}{P(failure)}\right ) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... +\beta_mX_m$$ The equation looks like a linear regression model which tries to predict log of odd ratio using linear combination of the features. This is why the model is named logistic regression.

The next challenge is to estimate the $\beta$ values. The problem is, $P(success)$ is not available in the dataset. The target values cannot be considered as $P(success)$ because in that case, $log\left (\frac{P(success)}{P(failure)}\right )$ would be either $\infty$ or $-\infty$. This is where some more calculations are required, and depending on the method chosen, $\beta$ values are estimated. In this post, I shall discuss the process of Maximum Likelihood Estimation (MLE). The reader can get a detailed coverage of this process in my post on MLE.

Maximum Likelihood Estimation of $\beta$

Since $P(success)$ is unknown, a different route is required to find the $\beta$ values. Let us assume $Z=\beta_0 + \beta_1X_1 + \beta_2X_2 + ... +\beta_mX_m$. This would mean, $log\left (\frac{P(success)}{P(failure)}\right )=Z$. For simplicity, let us assume $P(success) = p$. Then, $$log\left (\frac{p}{1-p}\right ) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... +\beta_mX_m=Z$$. After a simple calculation, the expression of $p$ becomes: $p=\frac{e^z}{1 + e^z}$. $\frac{e^z}{1 + e^z}$ is the logit function and thus, the probability of success is modelled as the logit function. The next thing is to use the concept of MLE, and for this, an assumption of distribution is needed. To use MLE, it is assumed that the outcomes of the target variable follow a Bernoulli distribution. If $y_i$ is the outcome of the $i^{th}$ data point, $y_i \in \{0,1\}$. Thus, probability mass function associated with the $i^{th}$ data point is: $$P(X_i|\beta_0,\beta_1,...\beta_m)=p_i^{y_i}(1-p_i)^{1-y_i}$$Assuming that the data points are all following i.i.d condition, the likelihood function is defined as: $$L=\prod_{i=1}^np_i^{y_i}(1-p_i)^{1-y_i}$$For practical purposes and mathematical convenience, we work with log of this likelihood function. The log-likelihood funtion is given as: $$l=log(L)=\sum_{i=1}^n\left [y_ilog(p_i)+(1-y_i)log(1-p_i)\right ]$$The expression within the square brackets is also known as the binary cross entropy, and, it is a popular loss function which can be used for parameter optimisation using different iterative optimisation methods such as Gradient Descent, L-BFGS or Newton's Method (just to name a few!). Clearly, $l$ is a function of $p$, which is a function of $\beta$s. Hence, to optimise (say) $\beta_k$ (the parameter associated with the $k^{th}$ variable) , we need to find the expression of $\frac{\partial l}{\partial \beta_k}$ and equate it to $0$. Mathematically, $$\frac{\partial l}{\partial \beta_k}=\sum_{i=1}^{n}y_i \frac{\partial log(p_i)}{\partial \beta_k}+\sum_{i=1}^{n}[(1-y_i) \frac{\partial log(1-p_i)}{\partial \beta_k}]=0$$After putting the value of $p_i = \frac{e^{z_i}}{1+e^{z_i}}$ in the above expression and doing some further calculations it will be seen that $$\frac{\partial l}{\partial \beta_k} = \sum_{i=1}^n [y_ix_{ik} - p_ix_{ik}]=0$$$x_{ik}$ is the $k^{th}$ variable in the $i^{th}$ data point. There will be $m+1$ number of such equations where $m$ is the number of features in the dataset. This final equation is not a closed form equation for $\beta_k$ and hence, we need to apply numerical methods to estimate the value of $\beta_k$. Newton Raphson's method is quite popular to find the values of $\beta_k$. However, other methods are also there for this purpose. Another important thing to note here is that the Hessian matrix need to be -ve definite to ensure that the likelihood value is maximized at the estimated values of $\beta_k$. If we differentiate $\frac{\partial l}{\partial \beta_k}$ again with respect to $\beta_{r}$ ($\beta$ associated with the $r^{th}$ variable), $$\frac{\partial l}{\partial \beta_k \partial \beta_{r}}=-\sum_{i=1}^nx_{ik}p_i(1-p_i)x_{ir}$$ In the matrix format the Hessian matrix will take the form $$H = -X^T\Sigma X$$ where $\Sigma$ is a diagonal matrix having diagonal elements $p_i(1-p_i) \in [0, 0.25]$. This matrix is positive definite. Hence, as long as the rank of the matrix $X$ is not rank deficient, $-X^T\Sigma X$ is -ve definite. Thus, parameters estimated using MLE for logistic regression is going to maximize the likelihood of observing the data. The below graph shows the logit function after fitting the data using logistic regression.

Conclusion

In this post I have discussed why logistic regression got its name and the very working of MLE to estimate the parameters of this model. Since $Z$ is a linear combination of predictor variables, and the predictor variables are not raised to any other power (other than 1), the discission boundary of logistic regression is linear. Logistic regression falls under linear model. However, with the adaptation of kernels, it is possible to make logistic regression map nonlinearity. I shall discuss that in a separate post.

Thursday, 10 April 2025

Let us understand Logistic Regression (Part 1)

Introduction

Finding solution

Maximum Likelihood Estimation of \(\beta\)

Conclusion

No comments:

Post a Comment

Let us understand Logistic Regression (Part 2)

Total Pageviews