Monday, 27 May 2024

Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do the calculations manually, and that is where iterative methods were never encouraged due to manual labour and hence the possibility of errors in calculation. During that time, those methods were preferred where the number of iterations was less. Maximum Likelihood Estimate (MLE) was one of the most preferred methods in this area and there were several algorithms whose parameters were estimated using MLE. In fact, this method of parameter estimation is still quite popular. In this blog post, I shall discuss about the math behind this wonderful method. 

Before using MLE in the estimation process, it is important to make an assumption on the distribution of either the data points or more importantly, the error terms. For example, to use MLE in estimating the parameters of a regression model, it is assumed that the error terms are normally distributed and, in logistic regression, it is assumed that the target variable is binomially distributed with some probability of success. Without such assumptions, it is impossible to use MLE. To use MLE, it is important to arrive at the expression of likelihood which is nothing but the joint probability of observing the data. To understand it mathematically, $$L=P(X_1, X_2, X_3,...,X_n)$$ where $[X_1, X_2, X_3, ..., X_n] \in P(X_i|\theta) \forall i={1,2,3,...,n}$. Here, $\theta$ is the parameter set of the probability function $f(.)$. For simplicity, in many algorithms, it is assumed that $X_i$s are all independently and identically distributed. That way, the expression of likelihood changes to a rather simpler expression because of the concept of independent events, i.e., $$L=P(X_1, X_2, X_3,..., X_n)=\prod_{i=1}^nP(X_i|\theta)$$ Here $\prod_{i=1}^n(.)$ is the product of individual elements within $()$. Once the expression of likelihood is correctly identified, the likelihood function becomes the function of the unknown parameters $\theta$ because all $X_i$s are already observed and that's why, they are constants within the likelihood function. Hence, even if likelihood points towards the probability of occurrence of events together, the likelihood function is not actually a probability function. Since likelihood is a function of parameters, maximum likelihood estimate focuses on estimating the parameters of the likelihood function such that the same function could attain the maximum value. In other words, the parameter set estimated through MLE will maximize the joint probability of occurrence of the observed instances. Mathematically, $$\hat \theta^{MLE} = \underset{\theta}{\operatorname {argmax}}\prod_{i=1}^nP(X_i|\theta) $$ So, if there are $k$ number of parameters to be estimated, by applying the rule of calculus, partial derivatives of likelihood function $L$ is equated to zero for all values of $\theta_j, \forall j={1,2,3,...,k}$. That is $$\frac {\partial L}{\partial \theta_j}=0, \forall j={1,2,3,...,k}$$ This process is consistent with the concept of finding maxima (or minima) of a function concerning the underlying parameter. Thus, for $k$ parameters, there would be $k$ equations and hence a unique solution can be found out. Sometimes, the equations are not closed-form equations (i.e., the parameter is a function of itself as in $x = sin(x+y)$). In such situations, numerical methods are required to be adopted (e.g., logistic regression) to estimate the parameters. Let us now use MLE in some cases.

Case 1

A coin is tossed 30 times and it is observed that heads came up 21 times. What is the probability of getting a head when it is tossed the next time?

Solution:

The solution is rather easy $p=\frac {21}{30}=0.70$. But this expression $p=\frac {\#\ of\ heads}{\#\ of\ trials}$ itself is a typical MLE of the probability of success. Let us try to derive it. Let $x_1, x_2, x_3,...,x_30$ are the 30 trials. We assume that the trials are independent of each other and identically distributed (which is quite fair in this situation!). Since there are only two outcomes, the entire process can be assumed to follow a binomial distribution. The probability of observing 21 heads out of 30 trials can be expressed as $$P(heads=21|p, 30)={30\choose 21}p^{21}(1-p)^{(30-21)}$$ here $p$ is the required probability of heads to be estimated. In the case of a discrete distribution like a binomial distribution, this expression is nothing but a probability mass function ($pmf$). It captures all the possible ways 21 heads can come up within 30 trials. If we take only one instance out of all the possible cases, the combination operator ${30 \choose 21}$ will go away and only $p^{21}(1-p)^{(30-21)}$ will be left out. Since 30 trials are run only once, we can assume that the likelihood function is $$L=p^{21}(1-p)^{9}$$ Once we obtain the likelihood function, then we can proceed with MLE as shown below: 

$$ \begin{align} L &= p^{21}(1-p)^{9}\\ \frac {\partial L}{\partial p}&= 21p^{(21-1)}(1-p)^9 + p^{21}(1-p)^{9-1}(-1)=0\\ &\implies 21p^{20}(1-p)^9 = p^{21}(1-p)^8 \\ &\implies 21(1-p)=p\\ &\implies p = \frac{21}{30}\end{align}$$ So, the result obtained earlier and with MLE are same. 

Case 2

Estimate the parameters of a linear regression model using MLE

Solution:

Let us assume that there is a dataset $D$ with $n$ data points in a $m$ dimensional space. Hence, $D_i=(X_i,y_i),\ \forall i = {1,2,3,...,n}$ and $X_i \in R^m$. This is a typical case of multiple regression where there are more than one predictor variable and one target variable. The linear regression model can be written as $$\hat y_i=\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + ... + \beta_mx_{im}$$ According to the above equation $\hat y_i$ is a linear combination of predictor variables (The linearity assumption). It is to be noted that $X_i = \{x_{i1}, x_{i2}, x_{i3}, ..., x_{im}\}$ where $x_{ij}$ is the $i^{th}$ value of the $j^{th}$ column. $\beta_0$ is the intercept. Thus, there are $m+1$ parameters to be estimated for this model. To apply, MLE, we need to make assumption about the distribution of an entity. Here comes another important assumption of linear regression, the errors are normally distributed. If $e_i$ is the error in prediction, $e_i = y_i - \hat y_i$ and as per the second assumption, $e_i \sim N(0, \sigma^2)$. Since error terms are normally distributed, another important assumption comes in implicitly, that is, error variance is homoskedastic in nature, meaning that the error terms are having uniform variations across the spectrum of the predictor variables. Or simply speaking, if error terms are plotted with respect to the target variable or any predictor variable, the scatter plot should look like a white noise without any pattern. The below image is helpful in understanding the meaning of homoskedasticity of error variance. 

Source: https://www.originlab.com/doc/Origin-Help/Residual-Plot-Analysis
With these three assumptions, it is possible to estimate the parameters of the linear regression model. For this we note that if $x \sim N(\mu, \sigma^2)$, $$P(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {\parallel x-\mu \parallel^2}{2\sigma^2}\right )$$ where $\mu$ represents the mean and $\sigma^2$ represents the variance of x respectively. Hence, if $e_i$ follows a normal distribution with 0 mean and $\sigma^2$ variance, $$\begin{align}P(e_i|0, \sigma^2) &= \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {\parallel e_i-0 \parallel^2}{2\sigma^2}\right ) \\ &= \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {e_i^2}{2\sigma^2}\right )\end{align}$$ Since there are $n$ data points within the dataset D, likelihood function of error is given by $$L = \prod_{i=1}^nP(e_i|0,\sigma^2)$$ The above expression is possible only if individual $e_i$ values are uncorrelated with each other. This is the fourth important assumption in linear regression, i.e., error terms are uncorrelated with each other (or absence of autocorrelation of error terms). Since $log$ is a monotonous function, if $log(f(x|\theta))$ attains maximum at some $\theta^*$, the function $f(x|\theta)$ would also attain maximum at the same $\theta^*$. Since likelihood deals with product of probabilities, we take $log$ of likelihood to convert product of probabilities into addition of log probabilities. Compared to multiplication, addition is easier to deal with while using calculus. Hence, we apply MLE on the log likelihood rather than the likelihood function. If $l$ is denoted as the log likelihood, 
$$\begin{align} l &= \sum_{i=1}^nln(P(e_i|0,\sigma^2)) \\ &= \sum_{i=1}^n ln \left [\frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {e_i^2}{2\sigma^2}\right ) \right ] \\ &= \sum_{i=1}^n ln \left [\frac{1}{\sqrt{2\pi \sigma^2}}\right ] + \sum_{i=1}^n ln \left [exp \left( - \frac {e_i^2}{2\sigma^2}\right ) \right ] \\ &= Const + \sum_{i=1}^n \left [-\frac{e_i^2}{2\sigma^2} \right ] \\ &= Const - \frac{1}{2\sigma^2} \sum_{i=1}^n e_i^2 \end{align}$$ As per the above expression, the only way to maximize $l$ is to minimize $ \sum_{i=1}^n e_i^2$ because error variance $\sigma^2$ is constant. Now $ \sum_{i=1}^n e_i^2$ is nothing but the sum of square of error terms. This means that log likelihood is going to be maximized if the sum of square of error (popularly denoted as SSE) is minimized. Now, $$\begin{align}\sum_{i=1}^n e_i^2&=\sum_{i=1}^n (y_i-\hat y_i)^2 \\ &= \sum_{i=1}^n (y_i-X_i\beta )^2 \\ &= (y-X\beta)^T(y-X\beta) \ \ \ \ \ \ \ in\ matrix\ format\ where\ \beta=\{\beta_0, \beta_1, \beta_2,...\beta_m\} \\ &=(y^T-(X\beta)^T)(y-X\beta) \\ &= (y^T - \beta^TX^T)(y-X\beta) \\ &= y^Ty - y^TX\beta -\beta^TX^Ty + \beta^TX^TX\beta \end{align}$$ The above function is a function of parameters $\beta$ and hence, to minimize the SSE, we need to differentiate the above expression with respect to $\beta$. This is where matrix calculus will come into play and we write $$\begin{align}\frac{\partial}{\partial \beta}\sum_{i=1}^n e_i^2 &=  \frac{\partial}{\partial \beta}(y^Ty - y^TX\beta -\beta^TX^Ty + \beta^TX^TX\beta) \\ &= 0 - X^Ty - X^Ty + 2X^TX\beta\ \ \ \ because\ X^TX\ is\ square\ symmetric \\ &= -2X^Ty + 2X^TX\beta \end{align}$$For minima, $\frac{\partial}{\partial \beta}\sum_{i=1}^n e_i^2=0$ for all $\beta_j$ values where $j=\{0,1,2,3,...,m\}$. Thus we get, $$-2X^Ty + 2X^TX\beta=0$$. After rearrangements, we get $$X^TX\beta = X^Ty$$ $$\implies \beta = (X^TX)^{-1}X^Ty$$The term $(X^TX)^{-1}X^T$ is called the pseudo inverse. The result that has been got is same as that of Ordinary Least Square (OLS) method. Another important assumption is hidden in this expression. If the columns are having multi-collinearity, $det(X^TX)$ would be very small and the inverse will have very high (or even infinite) values. That would make the model unreliable and would render rather useless for practical purposes. Hence, one more critical assumption come into play, the variables are uncorrelated with each other. Thus, linear regression comes with 5 very critical assumptions:
  1. The model comprises of linear combination of the predictor variables
  2. The error terms of the model are normally distributed
  3. The error terms have constant variance across the levels of the predictor variables (error variance is homoskedastic in nature)
  4. Error terms do not have any auto-correlation with themselves
  5. Variables are uncorrelated with each other (no multicollinearity)

With MLE of parameters, the likelihood value can also be found out and with this value two more metrics can be looked at to compare model's performance. These are Akaike Information Criteria (AIC) and Bayesian Information Criteria(BIC). $$AIC= 2\beta -2ln(Likelihood)$$ $$BIC=\beta ln(n) - 2ln(Likelihood)$$. These two criteria play important roles in deciding the best statistical model out of different models with different number of parameters to estimate. BIC is however more stringent than AIC. A classic use of AIC or BIC is in stepwise regression model generation where the important variables are kept within the model and least important variables are often discarded. 

In this post, I have tried to make the readers understand the process of MLE and how it is used in parameter estimation.  I hope you will find it useful to clarify the concepts.

No comments:

Post a Comment

EM Algorithm and its usage (Part 2) EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is q...