Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do the calculations manually, and that is where iterative methods were never encouraged due to manual labour and hence the possibility of errors in calculation. During that time, those methods were preferred where the number of iterations was less. Maximum Likelihood Estimate (MLE) was one of the most preferred methods in this area and there were several algorithms whose parameters were estimated using MLE. In fact, this method of parameter estimation is still quite popular. In this blog post, I shall discuss about the math behind this wonderful method.
Before using MLE in the estimation process, it is important to make an assumption on the distribution of either the data points or more importantly, the error terms. For example, to use MLE in estimating the parameters of a regression model, it is assumed that the error terms are normally distributed and, in logistic regression, it is assumed that the target variable is binomially distributed with some probability of success. Without such assumptions, it is impossible to use MLE. To use MLE, it is important to arrive at the expression of likelihood which is nothing but the joint probability of observing the data. To understand it mathematically, $$L=P(X_1, X_2, X_3,...,X_n)$$ where $[X_1, X_2, X_3, ..., X_n] \in P(X_i|\theta) \forall i={1,2,3,...,n}$. Here, $\theta$ is the parameter set of the probability function $f(.)$. For simplicity, in many algorithms, it is assumed that $X_i$s are all independently and identically distributed. That way, the expression of likelihood changes to a rather simpler expression because of the concept of independent events, i.e., $$L=P(X_1, X_2, X_3,..., X_n)=\prod_{i=1}^nP(X_i|\theta)$$ Here $\prod_{i=1}^n(.)$ is the product of individual elements within $()$. Once the expression of likelihood is correctly identified, the likelihood function becomes the function of the unknown parameters $\theta$ because all $X_i$s are already observed and that's why, they are constants within the likelihood function. Hence, even if likelihood points towards the probability of occurrence of events together, the likelihood function is not actually a probability function. Since likelihood is a function of parameters, maximum likelihood estimate focuses on estimating the parameters of the likelihood function such that the same function could attain the maximum value. In other words, the parameter set estimated through MLE will maximize the joint probability of occurrence of the observed instances. Mathematically, $$\hat \theta^{MLE} = \underset{\theta}{\operatorname {argmax}}\prod_{i=1}^nP(X_i|\theta) $$ So, if there are $k$ number of parameters to be estimated, by applying the rule of calculus, partial derivatives of likelihood function $L$ is equated to zero for all values of $\theta_j, \forall j={1,2,3,...,k}$. That is $$\frac {\partial L}{\partial \theta_j}=0, \forall j={1,2,3,...,k}$$ This process is consistent with the concept of finding maxima (or minima) of a function concerning the underlying parameter. Thus, for $k$ parameters, there would be $k$ equations and hence a unique solution can be found out. Sometimes, the equations are not closed-form equations (i.e., the parameter is a function of itself as in $x = sin(x+y)$). In such situations, numerical methods are required to be adopted (e.g., logistic regression) to estimate the parameters. Let us now use MLE in some cases.
Case 1
A coin is tossed 30 times and it is observed that heads came up 21 times. What is the probability of getting a head when it is tossed the next time?
Solution:
The solution is rather easy $p=\frac {21}{30}=0.70$. But this expression $p=\frac {\#\ of\ heads}{\#\ of\ trials}$ itself is a typical MLE of the probability of success. Let us try to derive it. Let $x_1, x_2, x_3,...,x_30$ are the 30 trials. We assume that the trials are independent of each other and identically distributed (which is quite fair in this situation!). Since there are only two outcomes, the entire process can be assumed to follow a binomial distribution. The probability of observing 21 heads out of 30 trials can be expressed as $$P(heads=21|p, 30)={30\choose 21}p^{21}(1-p)^{(30-21)}$$ here $p$ is the required probability of heads to be estimated. In the case of a discrete distribution like a binomial distribution, this expression is nothing but a probability mass function ($pmf$). It captures all the possible ways 21 heads can come up within 30 trials. If we take only one instance out of all the possible cases, the combination operator ${30 \choose 21}$ will go away and only $p^{21}(1-p)^{(30-21)}$ will be left out. Since 30 trials are run only once, we can assume that the likelihood function is $$L=p^{21}(1-p)^{9}$$ Once we obtain the likelihood function, then we can proceed with MLE as shown below:
$$ \begin{align} L &= p^{21}(1-p)^{9}\\ \frac {\partial L}{\partial p}&= 21p^{(21-1)}(1-p)^9 + p^{21}(1-p)^{9-1}(-1)=0\\ &\implies 21p^{20}(1-p)^9 = p^{21}(1-p)^8 \\ &\implies 21(1-p)=p\\ &\implies p = \frac{21}{30}\end{align}$$ So, the result obtained earlier and with MLE are same.
Case 2
Estimate the parameters of a linear regression model using MLE
Solution:
Let us assume that there is a dataset $D$ with $n$ data points in a $m$ dimensional space. Hence, $D_i=(X_i,y_i),\ \forall i = {1,2,3,...,n}$ and $X_i \in R^m$. This is a typical case of multiple regression where there are more than one predictor variable and one target variable. The linear regression model can be written as $$\hat y_i=\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + ... + \beta_mx_{im}$$ According to the above equation $\hat y_i$ is a linear combination of predictor variables (The linearity assumption). It is to be noted that $X_i = \{x_{i1}, x_{i2}, x_{i3}, ..., x_{im}\}$ where $x_{ij}$ is the $i^{th}$ value of the $j^{th}$ column. $\beta_0$ is the intercept. Thus, there are $m+1$ parameters to be estimated for this model. To apply, MLE, we need to make assumption about the distribution of an entity. Here comes another important assumption of linear regression, the errors are normally distributed. If $e_i$ is the error in prediction, $e_i = y_i - \hat y_i$ and as per the second assumption, $e_i \sim N(0, \sigma^2)$. Since error terms are normally distributed, another important assumption comes in implicitly, that is, error variance is homoskedastic in nature, meaning that the error terms are having uniform variations across the spectrum of the predictor variables. Or simply speaking, if error terms are plotted with respect to the target variable or any predictor variable, the scatter plot should look like a white noise without any pattern. The below image is helpful in understanding the meaning of homoskedasticity of error variance.
Source: https://www.originlab.com/doc/Origin-Help/Residual-Plot-Analysis |
- The model comprises of linear combination of the predictor variables
- The error terms of the model are normally distributed
- The error terms have constant variance across the levels of the predictor variables (error variance is homoskedastic in nature)
- Error terms do not have any auto-correlation with themselves
- Variables are uncorrelated with each other (no multicollinearity)
With MLE of parameters, the likelihood value can also be found out and with this value two more metrics can be looked at to compare model's performance. These are Akaike Information Criteria (AIC) and Bayesian Information Criteria(BIC). $$AIC= 2\beta -2ln(Likelihood)$$ $$BIC=\beta ln(n) - 2ln(Likelihood)$$. These two criteria play important roles in deciding the best statistical model out of different models with different number of parameters to estimate. BIC is however more stringent than AIC. A classic use of AIC or BIC is in stepwise regression model generation where the important variables are kept within the model and least important variables are often discarded.
In this post, I have tried to make the readers understand the process of MLE and how it is used in parameter estimation. I hope you will find it useful to clarify the concepts.