Loading [MathJax]/jax/output/CommonHTML/jax.js

Friday, 11 April 2025

Let us understand Logistic Regression (Part 2)

This is the second blog post on Logistic Regression. The first post on this topic is available here. The first post discussed the formulation of Logistic Regression and the estimation of its parameters using the concept of Maximum Likelihood Estimation. In this post I shall discuss some other aspects associated with Logistic Regression. 

Handling Multi-class problems


Logistic regression, by default, is a binary classifier. But, with some modifications, the same can be adopted for a multiclass classification problem also. Most commonly used methods are:
  • One versus Rest method
  • Multinomial Logistic Regression (with deviance loss)

One versus Rest method


This method is very much intuitive in nature. As the name suggests, one of the classes is considered the positive class and the remaining all are put into the negative class to make it binomial in nature. Hence, if there are $K$ classes, there will be $K$ logistic regression models. When an input is given, the input is passed through all the models. Each model predicts the probability of occurrence of the corresponding class. Finally, that class is chosen which has the highest probability among the $K$ probabilities. Mathematically, if for the $i^{th}$ data, the class is ${k}$, then
y(k)={1if y=k0otherwise
Correspondingly, the associated probability is calculated as per the equation given below.
P(y=kx)=σ(wkx+bk)=11+e(wkx+bk) where $\mathbf{x}$ is the input, $\mathbf{w}_k$ is the weight associated with the $k^{th}$ model and $b_k$ is the bias (or intercept). The final class is predicted based on the class having the highest probability. ˆy=argmaxkP(y=kx)

Multinomial Logistic Regression (with deviance loss)


Another method of performing multiclass classification using logistic regression is to use numerical optimisation algorithms on a loss function. In this process, for $K$ classes, not $K$ models are built. Instead, all classes are modelled using the softmax function. Suppose $ W\in R^{K\times d}$ is the weight matrix ($d$ being the dimension of the data) and $\mathbf{b} \in R^K$ is the bias. In that case, a score function is used for every class as given below z=Wx+b If $w_k$ denotes the parameters associated with the $k^{th}$ class, the score function $z_k$ will be $z_k=w_k^T\mathbf{x}+b_k$. This scoring function is converted to probability using the softmax function as shown below.
P(y=kx)=ewkx+bkKj=1ewjx+bj $P(y=k|\mathbf{x})$ is the probability that the input $\mathbf{x}$ belong to class $k$. With these probabilities, negative log loss is calculated as given below. L(W,b|X)=1nni=1logP(yixi)=1nni=1log(ezyKk=1ezk)=1nni=1[zylog(Kk=1ezk)] where $n$ is the number of data points and $z_k=w_k^T\mathbf{x_i}+b_k$. Once the loss function is defined, numerical methods such as L-BFGS, Genetic Algorithm, Conjugate Descent or Newton's Method can be employed easily to estimate the parameters. Sometimes it is required to regularize the parameters to make the model more generalized. Hence, the loss function is modified to accommodate the regularisation term ($L_2$ regularisation) as given below. L(W,b|X)=1nni=1L(W,b|xi)+λ2||W||2 where $\lambda$ is the regularisation parameter. Afterward, a numerical method can be applied to this loss function to estimate the regularised parameter. Because it is a multiclass classification task, to get the feature importance for a particular class, the respective weight vector needs to be looked at. The python code below generates a synthetic multiclass classification dataset and applies multinomial logistic regression, and plots feature importance plots for each class as heatmaps.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=5,
                           n_redundant=0, n_classes=4, n_clusters_per_class=1,
                           random_state=42)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit One-vs-Rest Logistic Regression
ovr_model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
ovr_model.fit(X_scaled, y)

# Fit Softmax (Multinomial) Logistic Regression
softmax_model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
softmax_model.fit(X_scaled, y)

# Plot heatmap of feature importance for OvR
plt.figure(figsize=(10, 4))
sns.heatmap(ovr_model.coef_, annot=True, cmap='coolwarm', center=0,
            xticklabels=[f'Feature {i}' for i in range(X.shape[1])],
            yticklabels=[f'Class {i}' for i in range(ovr_model.coef_.shape[0])])
plt.title("Feature Importance - One-vs-Rest Logistic Regression")
plt.xlabel("Features")
plt.ylabel("Classes")
plt.tight_layout()
plt.show()

# Plot heatmap of feature importance for Softmax
plt.figure(figsize=(10, 4))
sns.heatmap(softmax_model.coef_, annot=True, cmap='coolwarm', center=0,
            xticklabels=[f'Feature {i}' for i in range(X.shape[1])],
            yticklabels=[f'Class {i}' for i in range(softmax_model.coef_.shape[0])])
plt.title("Feature Importance - Softmax (Multinomial) Logistic Regression")
plt.xlabel("Features")
plt.ylabel("Classes")
plt.tight_layout()
plt.show()

The corresponding heat maps are shown below. 

As per the above plots, if feature 0 and feature 4 are increased by one unit, the log odds for class 2 will increase by 0.92 and 0.77, respectively, if softmax logistic regression is used. In a similar way, other parameters can be explained for different classes as well for both OvR and Softmax Logistic regression.

Conclusion

This post discussed the theoretical aspects associated with multiclass classification using logistic regression. The Python code is also provided for better understanding with feature importance plots. In the next post, I shall discuss the metrics associated with measuring the model's performance. 

Thursday, 10 April 2025

Let us understand Logistic Regression (Part 1)

Introduction

Logistic regression, as the name suggests, deals with 1) the logit function and 2) regression. To many people, this algorithm's name is rather confusing because the name contains the term regression, whereas the algorithm actually does classification. In this post, I will provide some details related to this algorithm and explain how the name came into existence.

Logistic regression is essentially binomial in nature, i.e., it can be used for binary classification only. However, with some smart tweaks, the same algorithm can also be used for multi-class classification. I am not going to discuss the multi-class classification in this post. It will be discussed in another post. So let us start with the basics. Below you can see the regression line for two separate cases, i.e., regression and classification. In part (a), there is no problem, but in the case of part (b) (the classification case), there are quite a few problems. 


Problem 1: Assumption of normality of residuals

    For linear regression, an important assumption is the normality of the residuals. Without this assumption, the estimated parameters will become biased. For the classification task, the target variable has only 2 values, i.e., either 0 or 1. Hence, the residuals are more likely to deviate from the normal distribution.

Problem 2: Assumption of homoskedasticity of error variance

    Another important assumption is the homoskedasticity of error variance for the levels of the predictor variables. If this assumption is not valid, the model cannot be generalized for different levels of the input variables. This will also result in the estimation of an incorrect confidence interval. As the target variable has only two values, the errors are also guaranteed to be heteroskedastic in nature.

Problem 3: Predictions are unbounded

    A linear regression line is unbounded and hence not restricted to be bounded between 0 and 1. This problem is, however solvable by adopting a clipping mechanism. That is, if any value goes beyond 1, then it will be brought back to 1 and if any value goes below 0, it will be clipped to 0. This will not solve the problems associated with the assumptions of linear regression.

A typical error analysis plot for the classification data is given below for getting a better understanding. The residuals look normally distributed, but at both ends they deviated. The error variance is also very different from white noise (homoskedastic nature). Hence, it is for sure that linear regression is highly unsuitable for classification tasks.

Finding solution

Probability of success (corresponding to target value 1) can range from 0 to 1, and there can be, theoretically, an infinite number of probabilities between these two values. This is good for a regression task. However, the probability is bound between 0 and 1. As stated in Problem 3, regression outcomes are unbounded, and hence there is a need to extend this binding nature of probability to some other entity which is unbounded. One option is to look for a ratio, i.e., $\frac{P(success)}{P(failure)}=\frac{P(success)}{1-P(success)}$.  So, as $P(success) \to 0$, $\frac{P(success)}{P(failure)} \to \infty$. This is a partial solution because this ratio can never go beyond 0. The solution is to take the logarithm of this ratio so that when $P(success) <  0.5$, $log(\frac{P(success)}{P(failure)}) < 0$ and this value can stretch from $-\infty$ (when $P(success) = 0$) to $\infty$ (when $P(success) = 1$). Not only this, the new entity is purely continuous in nature within $(-\infty, \infty)$. This new entity is the $log\ of\ odd\ ratio$ and the entity $\frac{P(success)}{P(failure)}$ is called the odd ratio. So, suppose the log of odd ratio is modelled as a linear combination of the features. What we have in our hand is a model: log(P(success)P(failure))=β0+β1X1+β2X2+...+βmXm The equation looks like a linear regression model which tries to predict log of odd ratio using linear combination of the features. This is why the model is named logistic regression.

The next challenge is to estimate the $\beta$ values. The problem is, $P(success)$ is not available in the dataset. The target values cannot be considered as $P(success)$ because in that case, $log\left (\frac{P(success)}{P(failure)}\right )$ would be either $\infty$ or $-\infty$. This is where some more calculations are required, and depending on the method chosen, $\beta$ values are estimated. In this post, I shall discuss the process of Maximum Likelihood Estimation (MLE). The reader can get a detailed coverage of this process in my post on MLE.

Maximum Likelihood Estimation of $\beta$

Since $P(success)$ is unknown, a different route is required to find the $\beta$ values. Let us assume $Z=\beta_0 + \beta_1X_1 + \beta_2X_2 + ... +\beta_mX_m$. This would mean, $log\left (\frac{P(success)}{P(failure)}\right )=Z$. For simplicity, let us assume $P(success) = p$. Then, log(p1p)=β0+β1X1+β2X2+...+βmXm=Z. After a simple calculation, the expression of $p$ becomes: $p=\frac{e^z}{1 + e^z}$. $\frac{e^z}{1 + e^z}$ is the logit function and thus, the probability of success is modelled as the logit function. The next thing is to use the concept of MLE, and for this, an assumption of distribution is needed. To use MLE, it is assumed that the outcomes of the target variable follow a Bernoulli distribution. If $y_i$ is the outcome of the $i^{th}$ data point, $y_i \in \{0,1\}$. Thus, probability mass function associated with the $i^{th}$ data point is: P(Xi|β0,β1,...βm)=pyii(1pi)1yiAssuming that the data points are all following i.i.d condition, the likelihood function is defined as: L=ni=1pyii(1pi)1yiFor practical purposes and mathematical convenience, we work with log of this likelihood function. The log-likelihood funtion is given as: l=log(L)=ni=1[yilog(pi)+(1yi)log(1pi)]The expression within the square brackets is also known as the binary cross entropy, and, it is a popular loss function which can be used for parameter optimisation using different iterative optimisation methods such as Gradient Descent, L-BFGS or Newton's Method (just to name a few!). Clearly, $l$ is a function of $p$, which is a function of $\beta$s. Hence, to optimise (say) $\beta_k$ (the parameter associated with the $k^{th}$ variable) , we need to find the expression of $\frac{\partial l}{\partial \beta_k}$ and equate it to $0$. Mathematically, lβk=ni=1yilog(pi)βk+ni=1[(1yi)log(1pi)βk]=0After putting the value of $p_i = \frac{e^{z_i}}{1+e^{z_i}}$ in the above expression and doing some further calculations it will be seen that lβk=ni=1[yixikpixik]=0$x_{ik}$ is the $k^{th}$ variable in the $i^{th}$ data point. There will be $m+1$ number of such equations where $m$ is the number of features in the dataset. This final equation is not a closed form equation for $\beta_k$ and hence, we need to apply numerical methods to estimate the value of $\beta_k$. Newton Raphson's method is quite popular to find the values of $\beta_k$. However, other methods are also there for this purpose. Another important thing to note here is that the Hessian matrix need to be -ve definite to ensure that the likelihood value is maximized at the estimated values of $\beta_k$. If we differentiate $\frac{\partial l}{\partial \beta_k}$ again with respect to $\beta_{r}$ ($\beta$ associated with the $r^{th}$ variable), lβkβr=ni=1xikpi(1pi)xir In the matrix format the Hessian matrix will take the form H=XTΣX where $\Sigma$ is a diagonal matrix having diagonal elements $p_i(1-p_i) \in [0, 0.25]$. This matrix is positive definite. Hence, as long as the rank of the matrix $X$ is not rank deficient, $-X^T\Sigma X$ is -ve definite. Thus, parameters estimated using MLE for logistic regression is going to maximize the likelihood of observing the data. The below graph shows the logit function after fitting the data using logistic regression.

Conclusion

In this post I have discussed why logistic regression got its name and the very working of MLE to estimate the parameters of this model. Since $Z$ is a linear combination of predictor variables, and the predictor variables are not raised to any other power (other than 1), the discission boundary of logistic regression is linear. Logistic regression falls under linear model. However, with the adaptation of kernels, it is possible to make logistic regression map nonlinearity. I shall discuss that in a separate post. 


Let us understand Logistic Regression (Part 2)

This is the second blog post on Logistic Regression. The first post on this topic is available here . The first post discussed the formulati...