Analytics for Everyone: April 2025

Friday, 11 April 2025

Let us understand Logistic Regression (Part 2)

This is the second blog post on Logistic Regression. The first post on this topic is available here. The first post discussed the formulation of Logistic Regression and the estimation of its parameters using the concept of Maximum Likelihood Estimation. In this post I shall discuss some other aspects associated with Logistic Regression.

Handling Multi-class problems

Logistic regression, by default, is a binary classifier. But, with some modifications, the same can be adopted for a multiclass classification problem also. Most commonly used methods are:

One versus Rest method
Multinomial Logistic Regression (with deviance loss)

One versus Rest method

This method is very much intuitive in nature. As the name suggests, one of the classes is considered the positive class and the remaining all are put into the negative class to make it binomial in nature. Hence, if there are

K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>

classes, there will be

K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>

logistic regression models. When an input is given, the input is passed through all the models. Each model predicts the probability of occurrence of the corresponding class. Finally, that class is chosen which has the highest probability among the

K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>

probabilities. Mathematically, if for the

i t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>i</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>

data, the class is

k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>

, then

y (k) = {1 if y = k 0 otherwise <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msup><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">(</mo><mi>k</mi><mo stretchy="false">)</mo></mrow></msup><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">{</mo><mtable columnalign="left left" columnspacing="1em" rowspacing=".2em"><mtr><mtd><mn>1</mn></mtd><mtd><mtext>if </mtext><mi>y</mi><mo>=</mo><mi>k</mi></mtd></mtr><mtr><mtd><mn>0</mn></mtd><mtd><mtext>otherwise</mtext></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE" fence="true" stretchy="true" symmetric="true"></mo></mrow></math>

Correspondingly, the associated probability is calculated as per the equation given below.

P(y=k∣x)=σ(w⊤kx+bk)=11+e−(w⊤kx+bk)<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>P</mi><mo stretchy="false">(</mo><mi>y</mi><mo>=</mo><mi>k</mi><mo>∣</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mi>σ</mi><mo stretchy="false">(</mo><msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mi>k</mi><mi mathvariant="normal">⊤</mi></msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo>+</mo><msub><mi>b</mi><mi>k</mi></msub><mo stretchy="false">)</mo><mo>=</mo><mfrac><mn>1</mn><mrow><mn>1</mn><mo>+</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mo stretchy="false">(</mo><msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mi>k</mi><mi mathvariant="normal">⊤</mi></msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo>+</mo><msub><mi>b</mi><mi>k</mi></msub><mo stretchy="false">)</mo></mrow></msup></mrow></mfrac></math>

where

x <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow></math>

is the input,

w k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mi>k</mi></msub></math>

is the weight associated with the

k t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>k</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>

model and

b k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>b</mi><mi>k</mi></msub></math>

is the bias (or intercept). The final class is predicted based on the class having the highest probability.

ˆ y = arg max k P (y = k ∣ x) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow data-mjx-texclass="ORD"><mover><mi>y</mi><mo stretchy="false">^</mo></mover></mrow><mo>=</mo><mi>arg</mi><mo data-mjx-texclass="NONE"></mo><munder><mo data-mjx-texclass="OP" movablelimits="true">max</mo><mrow data-mjx-texclass="ORD"><mi>k</mi></mrow></munder><mi>P</mi><mo stretchy="false">(</mo><mi>y</mi><mo>=</mo><mi>k</mi><mo>∣</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo stretchy="false">)</mo></math>

Multinomial Logistic Regression (with deviance loss)

Another method of performing multiclass classification using logistic regression is to use numerical optimisation algorithms on a loss function. In this process, for

K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>

classes, not

K <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>K</mi></math>

models are built. Instead, all classes are modelled using the softmax function. Suppose

W \in R K \times d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>W</mi><mo>\in</mo><msup><mi>R</mi><mrow data-mjx-texclass="ORD"><mi>K</mi><mo>\times</mo><mi>d</mi></mrow></msup></math>

is the weight matrix (

d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>d</mi></math>

being the dimension of the data) and

b \in R K <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">b</mi></mrow><mo>\in</mo><msup><mi>R</mi><mi>K</mi></msup></math>

is the bias. In that case, a score function is used for every class as given below

z = W x + b <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">z</mi></mrow><mo>=</mo><mi>W</mi><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo>+</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">b</mi></mrow></math>

w k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>w</mi><mi>k</mi></msub></math>

denotes the parameters associated with the

k t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>k</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>

class, the score function

z k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mi>k</mi></msub></math>

will be

z k = w T k x + b k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mi>k</mi></msub><mo>=</mo><msubsup><mi>w</mi><mi>k</mi><mi>T</mi></msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo>+</mo><msub><mi>b</mi><mi>k</mi></msub></math>

. This scoring function is converted to probability using the softmax function as shown below.

P(y=k∣x)=ew⊤kx+bk∑Kj=1ew⊤jx+bj<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>P</mi><mo stretchy="false">(</mo><mi>y</mi><mo>=</mo><mi>k</mi><mo>∣</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mfrac><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mi>k</mi><mi mathvariant="normal">⊤</mi></msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo>+</mo><msub><mi>b</mi><mi>k</mi></msub></mrow></msup><mrow><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></munderover><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">w</mi></mrow><mi>j</mi><mi mathvariant="normal">⊤</mi></msubsup><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo>+</mo><msub><mi>b</mi><mi>j</mi></msub></mrow></msup></mrow></mfrac></math>

P (y = k | x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>y</mi><mo>=</mo><mi>k</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow><mo stretchy="false">)</mo></math>

is the probability that the input

x <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">x</mi></mrow></math>

belong to class

k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>

. With these probabilities, negative log loss is calculated as given below.

L(W,b|X)=−1nn∑i=1logP(yi∣xi)=−1nn∑i=1log(ezy∑Kk=1ezk)=−1nn∑i=1[zy−log(K∑k=1ezk)]<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>L</mi><mo stretchy="false">(</mo><mi>W</mi><mo>,</mo><mi>b</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mo>−</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>P</mi><mo stretchy="false">(</mo><msub><mi>y</mi><mi>i</mi></msub><mo>∣</mo><mrow data-mjx-texclass="ORD"><msub><mi mathvariant="bold">x</mi><mi mathvariant="bold">i</mi></msub></mrow><mo stretchy="false">)</mo><mo>=</mo><mo>−</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><msub><mi>z</mi><mi>y</mi></msub></mrow></msup><mrow><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></munderover><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><msub><mi>z</mi><mi>k</mi></msub></mrow></msup></mrow></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>=</mo><mo>−</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><msub><mi>z</mi><mi>y</mi></msub><mo>−</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>K</mi></mrow></munderover><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><msub><mi>z</mi><mi>k</mi></msub></mrow></msup><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>

where

n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>

is the number of data points and

z k = w T k x i + b k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>z</mi><mi>k</mi></msub><mo>=</mo><msubsup><mi>w</mi><mi>k</mi><mi>T</mi></msubsup><mrow data-mjx-texclass="ORD"><msub><mi mathvariant="bold">x</mi><mi mathvariant="bold">i</mi></msub></mrow><mo>+</mo><msub><mi>b</mi><mi>k</mi></msub></math>

. Once the loss function is defined, numerical methods such as L-BFGS, Genetic Algorithm, Conjugate Descent or Newton's Method can be employed easily to estimate the parameters. Sometimes it is required to regularize the parameters to make the model more generalized. Hence, the loss function is modified to accommodate the regularisation term (

L 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>L</mi><mn>2</mn></msub></math>

regularisation) as given below.

L(W,b|X)=1nn∑i=1L(W,b|xi)+λ2||W||2<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>L</mi><mo stretchy="false">(</mo><mi>W</mi><mo>,</mo><mi>b</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mi mathvariant="bold">X</mi></mrow><mo stretchy="false">)</mo><mo>=</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mi>L</mi><mo stretchy="false">(</mo><mi>W</mi><mo>,</mo><mi>b</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><msub><mi mathvariant="bold">x</mi><mi mathvariant="bold">i</mi></msub></mrow><mo stretchy="false">)</mo><mo>+</mo><mfrac><mi>λ</mi><mn>2</mn></mfrac><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>W</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msup><mo stretchy="false">|</mo><mn>2</mn></msup></math>

where

λ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi></math>

is the regularisation parameter. Afterward, a numerical method can be applied to this loss function to estimate the regularised parameter. Because it is a multiclass classification task, to get the feature importance for a particular class, the respective weight vector needs to be looked at. The python code below generates a synthetic multiclass classification dataset and applies multinomial logistic regression, and plots feature importance plots for each class as heatmaps.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=5,
                           n_redundant=0, n_classes=4, n_clusters_per_class=1,
                           random_state=42)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit One-vs-Rest Logistic Regression
ovr_model = LogisticRegression(multi_class='ovr', solver='lbfgs', max_iter=1000)
ovr_model.fit(X_scaled, y)

# Fit Softmax (Multinomial) Logistic Regression
softmax_model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
softmax_model.fit(X_scaled, y)

# Plot heatmap of feature importance for OvR
plt.figure(figsize=(10, 4))
sns.heatmap(ovr_model.coef_, annot=True, cmap='coolwarm', center=0,
            xticklabels=[f'Feature {i}' for i in range(X.shape[1])],
            yticklabels=[f'Class {i}' for i in range(ovr_model.coef_.shape[0])])
plt.title("Feature Importance - One-vs-Rest Logistic Regression")
plt.xlabel("Features")
plt.ylabel("Classes")
plt.tight_layout()
plt.show()

# Plot heatmap of feature importance for Softmax
plt.figure(figsize=(10, 4))
sns.heatmap(softmax_model.coef_, annot=True, cmap='coolwarm', center=0,
            xticklabels=[f'Feature {i}' for i in range(X.shape[1])],
            yticklabels=[f'Class {i}' for i in range(softmax_model.coef_.shape[0])])
plt.title("Feature Importance - Softmax (Multinomial) Logistic Regression")
plt.xlabel("Features")
plt.ylabel("Classes")
plt.tight_layout()
plt.show()

The corresponding heat maps are shown below.

As per the above plots, if feature 0 and feature 4 are increased by one unit, the log odds for class 2 will increase by 0.92 and 0.77, respectively, if softmax logistic regression is used. In a similar way, other parameters can be explained for different classes as well for both OvR and Softmax Logistic regression.

Conclusion

This post discussed the theoretical aspects associated with multiclass classification using logistic regression. The Python code is also provided for better understanding with feature importance plots. In the next post, I shall discuss the metrics associated with measuring the model's performance.

Thursday, 10 April 2025

Let us understand Logistic Regression (Part 1)

Introduction

Logistic regression, as the name suggests, deals with 1) the logit function and 2) regression. To many people, this algorithm's name is rather confusing because the name contains the term regression, whereas the algorithm actually does classification. In this post, I will provide some details related to this algorithm and explain how the name came into existence.

Logistic regression is essentially binomial in nature, i.e., it can be used for binary classification only. However, with some smart tweaks, the same algorithm can also be used for multi-class classification. I am not going to discuss the multi-class classification in this post. It will be discussed in another post. So let us start with the basics. Below you can see the regression line for two separate cases, i.e., regression and classification. In part (a), there is no problem, but in the case of part (b) (the classification case), there are quite a few problems.

Problem 1: Assumption of normality of residuals

For linear regression, an important assumption is the normality of the residuals. Without this assumption, the estimated parameters will become biased. For the classification task, the target variable has only 2 values, i.e., either 0 or 1. Hence, the residuals are more likely to deviate from the normal distribution.

Problem 2: Assumption of homoskedasticity of error variance

Another important assumption is the homoskedasticity of error variance for the levels of the predictor variables. If this assumption is not valid, the model cannot be generalized for different levels of the input variables. This will also result in the estimation of an incorrect confidence interval. As the target variable has only two values, the errors are also guaranteed to be heteroskedastic in nature.

Problem 3: Predictions are unbounded

A linear regression line is unbounded and hence not restricted to be bounded between 0 and 1. This problem is, however solvable by adopting a clipping mechanism. That is, if any value goes beyond 1, then it will be brought back to 1 and if any value goes below 0, it will be clipped to 0. This will not solve the problems associated with the assumptions of linear regression.

A typical error analysis plot for the classification data is given below for getting a better understanding. The residuals look normally distributed, but at both ends they deviated. The error variance is also very different from white noise (homoskedastic nature). Hence, it is for sure that linear regression is highly unsuitable for classification tasks.

Finding solution

Probability of success (corresponding to target value 1) can range from 0 to 1, and there can be, theoretically, an infinite number of probabilities between these two values. This is good for a regression task. However, the probability is bound between 0 and 1. As stated in Problem 3, regression outcomes are unbounded, and hence there is a need to extend this binding nature of probability to some other entity which is unbounded. One option is to look for a ratio, i.e., $P(success)P(failure)=P(success)1−P(success)<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></mrow><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>f</mi><mi>a</mi><mi>i</mi><mi>l</mi><mi>u</mi><mi>r</mi><mi>e</mi><mo stretchy="false">)</mo></mrow></mfrac><mo>=</mo><mfrac><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></mrow><mrow><mn>1</mn><mo>−</mo><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></mrow></mfrac></math>$ . So, as $P (s u c c e s s) \to 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo><mo accent="false" stretchy="false">\to</mo><mn>0</mn></math>$ , $P(success)P(failure)→∞<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></mrow><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>f</mi><mi>a</mi><mi>i</mi><mi>l</mi><mi>u</mi><mi>r</mi><mi>e</mi><mo stretchy="false">)</mo></mrow></mfrac><mo accent="false" stretchy="false">→</mo><mi mathvariant="normal">∞</mi></math>$ . This is a partial solution because this ratio can never go beyond 0. The solution is to take the logarithm of this ratio so that when $P (s u c c e s s) < 0.5 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo><mo><</mo><mn>0.5</mn></math>$ , $log(P(success)P(failure))<0<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mfrac><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></mrow><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>f</mi><mi>a</mi><mi>i</mi><mi>l</mi><mi>u</mi><mi>r</mi><mi>e</mi><mo stretchy="false">)</mo></mrow></mfrac><mo stretchy="false">)</mo><mo><</mo><mn>0</mn></math>$ and this value can stretch from $- \infty <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>-</mo><mi mathvariant="normal">\infty</mi></math>$ (when $P (s u c c e s s) = 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo><mo>=</mo><mn>0</mn></math>$ ) to $\infty <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">\infty</mi></math>$ (when $P (s u c c e s s) = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo><mo>=</mo><mn>1</mn></math>$ ). Not only this, the new entity is purely continuous in nature within $(- \infty, \infty) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mo>-</mo><mi mathvariant="normal">\infty</mi><mo>,</mo><mi mathvariant="normal">\infty</mi><mo stretchy="false">)</mo></math>$ . This new entity is the $l o g o f o d d r a t i o <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi><mi>o</mi><mi>g</mi><mtext> </mtext><mi>o</mi><mi>f</mi><mtext> </mtext><mi>o</mi><mi>d</mi><mi>d</mi><mtext> </mtext><mi>r</mi><mi>a</mi><mi>t</mi><mi>i</mi><mi>o</mi></math>$ and the entity $P(success)P(failure)<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></mrow><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>f</mi><mi>a</mi><mi>i</mi><mi>l</mi><mi>u</mi><mi>r</mi><mi>e</mi><mo stretchy="false">)</mo></mrow></mfrac></math>$ is called the odd ratio. So, suppose the log of odd ratio is modelled as a linear combination of the features. What we have in our hand is a model: $log(P(success)P(failure))=β0+β1X1+β2X2+...+βmXm<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>l</mi><mi>o</mi><mi>g</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></mrow><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>f</mi><mi>a</mi><mi>i</mi><mi>l</mi><mi>u</mi><mi>r</mi><mi>e</mi><mo stretchy="false">)</mo></mrow></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>=</mo><msub><mi>β</mi><mn>0</mn></msub><mo>+</mo><msub><mi>β</mi><mn>1</mn></msub><msub><mi>X</mi><mn>1</mn></msub><mo>+</mo><msub><mi>β</mi><mn>2</mn></msub><msub><mi>X</mi><mn>2</mn></msub><mo>+</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>+</mo><msub><mi>β</mi><mi>m</mi></msub><msub><mi>X</mi><mi>m</mi></msub></math>$ The equation looks like a linear regression model which tries to predict log of odd ratio using linear combination of the features. This is why the model is named logistic regression.

The next challenge is to estimate the $β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi></math>$ values. The problem is, $P (s u c c e s s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></math>$ is not available in the dataset. The target values cannot be considered as $P (s u c c e s s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></math>$ because in that case, $log(P(success)P(failure))<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi><mi>o</mi><mi>g</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></mrow><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>f</mi><mi>a</mi><mi>i</mi><mi>l</mi><mi>u</mi><mi>r</mi><mi>e</mi><mo stretchy="false">)</mo></mrow></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$ would be either $\infty <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">\infty</mi></math>$ or $- \infty <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>-</mo><mi mathvariant="normal">\infty</mi></math>$ . This is where some more calculations are required, and depending on the method chosen, $β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi></math>$ values are estimated. In this post, I shall discuss the process of Maximum Likelihood Estimation (MLE). The reader can get a detailed coverage of this process in my post on MLE.

Maximum Likelihood Estimation of $β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi></math>$

Since $P (s u c c e s s) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></math>$ is unknown, a different route is required to find the $β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi></math>$ values. Let us assume $Z = β 0 + β 1 X 1 + β 2 X 2 + . . . + β m X m <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Z</mi><mo>=</mo><msub><mi>β</mi><mn>0</mn></msub><mo>+</mo><msub><mi>β</mi><mn>1</mn></msub><msub><mi>X</mi><mn>1</mn></msub><mo>+</mo><msub><mi>β</mi><mn>2</mn></msub><msub><mi>X</mi><mn>2</mn></msub><mo>+</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>+</mo><msub><mi>β</mi><mi>m</mi></msub><msub><mi>X</mi><mi>m</mi></msub></math>$ . This would mean, $log(P(success)P(failure))=Z<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi><mi>o</mi><mi>g</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo></mrow><mrow><mi>P</mi><mo stretchy="false">(</mo><mi>f</mi><mi>a</mi><mi>i</mi><mi>l</mi><mi>u</mi><mi>r</mi><mi>e</mi><mo stretchy="false">)</mo></mrow></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>=</mo><mi>Z</mi></math>$ . For simplicity, let us assume $P (s u c c e s s) = p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>P</mi><mo stretchy="false">(</mo><mi>s</mi><mi>u</mi><mi>c</mi><mi>c</mi><mi>e</mi><mi>s</mi><mi>s</mi><mo stretchy="false">)</mo><mo>=</mo><mi>p</mi></math>$ . Then, $log(p1−p)=β0+β1X1+β2X2+...+βmXm=Z<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>l</mi><mi>o</mi><mi>g</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mi>p</mi><mrow><mn>1</mn><mo>−</mo><mi>p</mi></mrow></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>=</mo><msub><mi>β</mi><mn>0</mn></msub><mo>+</mo><msub><mi>β</mi><mn>1</mn></msub><msub><mi>X</mi><mn>1</mn></msub><mo>+</mo><msub><mi>β</mi><mn>2</mn></msub><msub><mi>X</mi><mn>2</mn></msub><mo>+</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>+</mo><msub><mi>β</mi><mi>m</mi></msub><msub><mi>X</mi><mi>m</mi></msub><mo>=</mo><mi>Z</mi></math>$ . After a simple calculation, the expression of $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ becomes: $p=ez1+ez<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo>=</mo><mfrac><msup><mi>e</mi><mi>z</mi></msup><mrow><mn>1</mn><mo>+</mo><msup><mi>e</mi><mi>z</mi></msup></mrow></mfrac></math>$ . $ez1+ez<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><msup><mi>e</mi><mi>z</mi></msup><mrow><mn>1</mn><mo>+</mo><msup><mi>e</mi><mi>z</mi></msup></mrow></mfrac></math>$ is the logit function and thus, the probability of success is modelled as the logit function. The next thing is to use the concept of MLE, and for this, an assumption of distribution is needed. To use MLE, it is assumed that the outcomes of the target variable follow a Bernoulli distribution. If $y i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>y</mi><mi>i</mi></msub></math>$ is the outcome of the $i t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>i</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>$ data point, $y i \in {0, 1} <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>y</mi><mi>i</mi></msub><mo>\in</mo><mo fence="false" stretchy="false">{</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo fence="false" stretchy="false">}</mo></math>$ . Thus, probability mass function associated with the $i t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>i</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>$ data point is: $P (X i | β 0, β 1, . . . β m) = p y i i (1 - p i) 1 - y i <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>P</mi><mo stretchy="false">(</mo><msub><mi>X</mi><mi>i</mi></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>β</mi><mn>0</mn></msub><mo>,</mo><msub><mi>β</mi><mn>1</mn></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><msub><mi>β</mi><mi>m</mi></msub><mo stretchy="false">)</mo><mo>=</mo><msubsup><mi>p</mi><mi>i</mi><mrow data-mjx-texclass="ORD"><msub><mi>y</mi><mi>i</mi></msub></mrow></msubsup><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>p</mi><mi>i</mi></msub><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub></mrow></msup></math>$ Assuming that the data points are all following i.i.d condition, the likelihood function is defined as: $L = n \prod i = 1 p y i i (1 - p i) 1 - y i <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>L</mi><mo>=</mo><munderover><mo data-mjx-texclass="OP">\prod</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><msubsup><mi>p</mi><mi>i</mi><mrow data-mjx-texclass="ORD"><msub><mi>y</mi><mi>i</mi></msub></mrow></msubsup><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>p</mi><mi>i</mi></msub><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mn>1</mn><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub></mrow></msup></math>$ For practical purposes and mathematical convenience, we work with log of this likelihood function. The log-likelihood funtion is given as: $l = l o g (L) = n \sum i = 1 [y i l o g (p i) + (1 - y i) l o g (1 - p i)] <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>l</mi><mo>=</mo><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mi>L</mi><mo stretchy="false">)</mo><mo>=</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><msub><mi>y</mi><mi>i</mi></msub><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><msub><mi>p</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo>+</mo><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>p</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$ The expression within the square brackets is also known as the binary cross entropy, and, it is a popular loss function which can be used for parameter optimisation using different iterative optimisation methods such as Gradient Descent, L-BFGS or Newton's Method (just to name a few!). Clearly, $l <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>l</mi></math>$ is a function of $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ , which is a function of $β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi></math>$ s. Hence, to optimise (say) $β k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mi>k</mi></msub></math>$ (the parameter associated with the $k t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>k</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>$ variable) , we need to find the expression of $∂l∂βk<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mrow><mi>∂</mi><mi>l</mi></mrow><mrow><mi>∂</mi><msub><mi>β</mi><mi>k</mi></msub></mrow></mfrac></math>$ and equate it to $0 <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>0</mn></math>$ . Mathematically, $∂l∂βk=n∑i=1yi∂log(pi)∂βk+n∑i=1[(1−yi)∂log(1−pi)∂βk]=0<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mfrac><mrow><mi>∂</mi><mi>l</mi></mrow><mrow><mi>∂</mi><msub><mi>β</mi><mi>k</mi></msub></mrow></mfrac><mo>=</mo><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow></munderover><msub><mi>y</mi><mi>i</mi></msub><mfrac><mrow><mi>∂</mi><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><msub><mi>p</mi><mi>i</mi></msub><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msub><mi>β</mi><mi>k</mi></msub></mrow></mfrac><mo>+</mo><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow data-mjx-texclass="ORD"><mi>n</mi></mrow></munderover><mo stretchy="false">[</mo><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><msub><mi>y</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mfrac><mrow><mi>∂</mi><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><msub><mi>p</mi><mi>i</mi></msub><mo stretchy="false">)</mo></mrow><mrow><mi>∂</mi><msub><mi>β</mi><mi>k</mi></msub></mrow></mfrac><mo stretchy="false">]</mo><mo>=</mo><mn>0</mn></math>$ After putting the value of $pi=ezi1+ezi<math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mi>i</mi></msub><mo>=</mo><mfrac><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><msub><mi>z</mi><mi>i</mi></msub></mrow></msup><mrow><mn>1</mn><mo>+</mo><msup><mi>e</mi><mrow data-mjx-texclass="ORD"><msub><mi>z</mi><mi>i</mi></msub></mrow></msup></mrow></mfrac></math>$ in the above expression and doing some further calculations it will be seen that $∂l∂βk=n∑i=1[yixik−pixik]=0<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mfrac><mrow><mi>∂</mi><mi>l</mi></mrow><mrow><mi>∂</mi><msub><mi>β</mi><mi>k</mi></msub></mrow></mfrac><mo>=</mo><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mo stretchy="false">[</mo><msub><mi>y</mi><mi>i</mi></msub><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>k</mi></mrow></msub><mo>−</mo><msub><mi>p</mi><mi>i</mi></msub><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>k</mi></mrow></msub><mo stretchy="false">]</mo><mo>=</mo><mn>0</mn></math>$ $x i k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>k</mi></mrow></msub></math>$ is the $k t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>k</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>$ variable in the $i t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>i</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>$ data point. There will be $m + 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>m</mi><mo>+</mo><mn>1</mn></math>$ number of such equations where $m <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>m</mi></math>$ is the number of features in the dataset. This final equation is not a closed form equation for $β k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mi>k</mi></msub></math>$ and hence, we need to apply numerical methods to estimate the value of $β k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mi>k</mi></msub></math>$ . Newton Raphson's method is quite popular to find the values of $β k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mi>k</mi></msub></math>$ . However, other methods are also there for this purpose. Another important thing to note here is that the Hessian matrix need to be -ve definite to ensure that the likelihood value is maximized at the estimated values of $β k <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mi>k</mi></msub></math>$ . If we differentiate $∂l∂βk<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mrow><mi>∂</mi><mi>l</mi></mrow><mrow><mi>∂</mi><msub><mi>β</mi><mi>k</mi></msub></mrow></mfrac></math>$ again with respect to $β r <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>r</mi></mrow></msub></math>$ ( $β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi></math>$ associated with the $r t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>r</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>$ variable), $∂l∂βk∂βr=−n∑i=1xikpi(1−pi)xir<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mfrac><mrow><mi>∂</mi><mi>l</mi></mrow><mrow><mi>∂</mi><msub><mi>β</mi><mi>k</mi></msub><mi>∂</mi><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>r</mi></mrow></msub></mrow></mfrac><mo>=</mo><mo>−</mo><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>k</mi></mrow></msub><msub><mi>p</mi><mi>i</mi></msub><mo stretchy="false">(</mo><mn>1</mn><mo>−</mo><msub><mi>p</mi><mi>i</mi></msub><mo stretchy="false">)</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>r</mi></mrow></msub></math>$ In the matrix format the Hessian matrix will take the form $H = - X T Σ X <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>H</mi><mo>=</mo><mo>-</mo><msup><mi>X</mi><mi>T</mi></msup><mi mathvariant="normal">Σ</mi><mi>X</mi></math>$ where $Σ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Σ</mi></math>$ is a diagonal matrix having diagonal elements $p i (1 - p i) \in [0, 0.25] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>p</mi><mi>i</mi></msub><mo stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>p</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo>\in</mo><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>0.25</mn><mo stretchy="false">]</mo></math>$ . This matrix is positive definite. Hence, as long as the rank of the matrix $X <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math>$ is not rank deficient, $- X T Σ X <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>-</mo><msup><mi>X</mi><mi>T</mi></msup><mi mathvariant="normal">Σ</mi><mi>X</mi></math>$ is -ve definite. Thus, parameters estimated using MLE for logistic regression is going to maximize the likelihood of observing the data. The below graph shows the logit function after fitting the data using logistic regression.

Conclusion

In this post I have discussed why logistic regression got its name and the very working of MLE to estimate the parameters of this model. Since $Z <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>Z</mi></math>$ is a linear combination of predictor variables, and the predictor variables are not raised to any other power (other than 1), the discission boundary of logistic regression is linear. Logistic regression falls under linear model. However, with the adaptation of kernels, it is possible to make logistic regression map nonlinearity. I shall discuss that in a separate post.

Friday, 11 April 2025

Let us understand Logistic Regression (Part 2)

Handling Multi-class problems

One versus Rest method

Multinomial Logistic Regression (with deviance loss)

Conclusion

Thursday, 10 April 2025

Let us understand Logistic Regression (Part 1)

Introduction

Finding solution

Maximum Likelihood Estimation of β<math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi></math>

Conclusion

Let us understand Logistic Regression (Part 2)

Maximum Likelihood Estimation of $β <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>β</mi></math>$