Analytics for Everyone: Let us understand Logistic Regression (Part 2)

This is the second blog post on Logistic Regression. The first post on this topic is available here. The first post discussed the formulation of Logistic Regression and the estimation of its parameters using the concept of Maximum Likelihood Estimation. In this post I shall discuss some other aspects associated with Logistic Regression.

Handling Multi-class problems

Logistic regression, by default, is a binary classifier. But, with some modifications, the same can be adopted for a multiclass classification problem also. Most commonly used methods are:

One versus Rest method
Multinomial Logistic Regression (with deviance loss)

One versus Rest method

This method is very much intuitive in nature. As the name suggests, one of the classes is considered the positive class and the remaining all are put into the negative class to make it binomial in nature. Hence, if there are $K$ classes, there will be $K$ logistic regression models. When an input is given, the input is passed through all the models. Each model predicts the probability of occurrence of the corresponding class. Finally, that class is chosen which has the highest probability among the $K$ probabilities. Mathematically, if for the $i^{th}$ data, the class is $k$, then

$$y^{(k)} = \begin{cases} 1 & \text{if } y = k \\ 0 & \text{otherwise}\end{cases}$$

Correspondingly, the associated probability is calculated as per the equation given below.

$$P(y = k \mid \mathbf{x}) = \sigma(\mathbf{w}_k^\top \mathbf{x} + b_k) = \frac{1}{1 + e^{-(\mathbf{w}_k^\top \mathbf{x} + b_k)}}$$ where $\mathbf{x}$ is the input, $\mathbf{w}_k$ is the weight associated with the $k^{th}$ model and $b_k$ is the bias (or intercept). The final class is predicted based on the class having the highest probability. $$\hat{y} = \arg\max_{k} P(y = k \mid \mathbf{x})$$

Multinomial Logistic Regression (with deviance loss)

Another method of performing multiclass classification using logistic regression is to use numerical optimisation algorithms on a loss function. In this process, for $K$ classes, not $K$ models are built. Instead, all classes are modelled using the softmax function. Suppose $ W\in R^{K\times d}$ is the weight matrix ($d$ being the dimension of the data) and $\mathbf{b} \in R^K$ is the bias. In that case, a score function is used for every class as given below $$\mathbf{z}=W\mathbf{x}+\mathbf{b}$$ If $w_k$ denotes the parameters associated with the $k^{th}$ class, the score function $z_k$ will be $z_k=w_k^T\mathbf{x}+b_k$. This scoring function is converted to probability using the softmax function as shown below.

$$P(y = k \mid \mathbf{x}) = \frac{e^{\mathbf{w}_k^\top \mathbf{x} + b_k}}{\sum_{j=1}^{K} e^{\mathbf{w}_j^\top \mathbf{x} + b_j}}$$ $P(y=k|\mathbf{x})$ is the probability that the input $\mathbf{x}$ belong to class $k$. With these probabilities, negative log loss is calculated as given below. $$L(W, b | \mathbf{X}) = -\frac{1}{n}\sum_{i=1}^n\log P(y_i \mid \mathbf{x_i}) = -\frac{1}{n}\sum_{i=1}^n\log\left( \frac{e^{z_y}}{\sum_{k=1}^{K} e^{z_k}} \right) = -\frac{1}{n}\sum_{i=1}^n \left [z_y - \log\left( \sum_{k=1}^{K} e^{z_k} \right) \right ]$$ where $n$ is the number of data points and $z_k=w_k^T\mathbf{x_i}+b_k$. Once the loss function is defined, numerical methods such as L-BFGS, Genetic Algorithm, Conjugate Descent or Newton's Method can be employed easily to estimate the parameters. Sometimes it is required to regularize the parameters to make the model more generalized. Hence, the loss function is modified to accommodate the regularisation term ($L_2$ regularisation) as given below. $$L(W,b |\mathbf{X})=\frac{1}{n}\sum_{i=1}^nL(W,b|\mathbf{x_i})+\frac{\lambda}{2}||W||^2$$ where $\lambda$ is the regularisation parameter. Afterward, a numerical method can be applied to this loss function to estimate the regularised parameter. Because it is a multiclass classification task, to get the feature importance for a particular class, the respective weight vector needs to be looked at. The python code below generates a synthetic multiclass classification dataset and applies multinomial logistic regression, and plots feature importance plots for each class as heatmaps.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=5,
                           n_redundant=0, n_classes=4, n_clusters_per_class=1,
                           random_state=42)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit One-vs-Rest Logistic Regression
ovr_model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
ovr_model.fit(X_scaled, y)

# Fit Softmax (Multinomial) Logistic Regression
softmax_model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
softmax_model.fit(X_scaled, y)

# Plot heatmap of feature importance for OvR
plt.figure(figsize=(10, 4))
sns.heatmap(ovr_model.coef_, annot=True, cmap='coolwarm', center=0,
            xticklabels=[f'Feature {i}' for i in range(X.shape[1])],
            yticklabels=[f'Class {i}' for i in range(ovr_model.coef_.shape[0])])
plt.title("Feature Importance - One-vs-Rest Logistic Regression")
plt.xlabel("Features")
plt.ylabel("Classes")
plt.tight_layout()
plt.show()

# Plot heatmap of feature importance for Softmax
plt.figure(figsize=(10, 4))
sns.heatmap(softmax_model.coef_, annot=True, cmap='coolwarm', center=0,
            xticklabels=[f'Feature {i}' for i in range(X.shape[1])],
            yticklabels=[f'Class {i}' for i in range(softmax_model.coef_.shape[0])])
plt.title("Feature Importance - Softmax (Multinomial) Logistic Regression")
plt.xlabel("Features")
plt.ylabel("Classes")
plt.tight_layout()
plt.show()

The corresponding heat maps are shown below.

As per the above plots, if feature 0 and feature 4 are increased by one unit, the log odds for class 2 will increase by 0.92 and 0.77, respectively, if softmax logistic regression is used. In a similar way, other parameters can be explained for different classes as well for both OvR and Softmax Logistic regression.

Conclusion

This post discussed the theoretical aspects associated with multiclass classification using logistic regression. The Python code is also provided for better understanding with feature importance plots. In the next post, I shall discuss the metrics associated with measuring the model's performance.

Friday, 11 April 2025

Let us understand Logistic Regression (Part 2)

Handling Multi-class problems

One versus Rest method

Multinomial Logistic Regression (with deviance loss)

Conclusion

No comments:

Post a Comment

Let us understand Logistic Regression (Part 2)

Total Pageviews