Processing math: 100%

Monday, 27 May 2024

Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do the calculations manually, and that is where iterative methods were never encouraged due to manual labour and hence the possibility of errors in calculation. During that time, those methods were preferred where the number of iterations was less. Maximum Likelihood Estimate (MLE) was one of the most preferred methods in this area and there were several algorithms whose parameters were estimated using MLE. In fact, this method of parameter estimation is still quite popular. In this blog post, I shall discuss about the math behind this wonderful method. 

Before using MLE in the estimation process, it is important to make an assumption on the distribution of either the data points or more importantly, the error terms. For example, to use MLE in estimating the parameters of a regression model, it is assumed that the error terms are normally distributed and, in logistic regression, it is assumed that the target variable is binomially distributed with some probability of success. Without such assumptions, it is impossible to use MLE. To use MLE, it is important to arrive at the expression of likelihood which is nothing but the joint probability of observing the data. To understand it mathematically, L=P(X1,X2,X3,...,Xn) where [X1,X2,X3,...,Xn]P(Xi|θ)i=1,2,3,...,n. Here, θ is the parameter set of the probability function f(.). For simplicity, in many algorithms, it is assumed that Xis are all independently and identically distributed. That way, the expression of likelihood changes to a rather simpler expression because of the concept of independent events, i.e., L=P(X1,X2,X3,...,Xn)=ni=1P(Xi|θ) Here ni=1(.) is the product of individual elements within (). Once the expression of likelihood is correctly identified, the likelihood function becomes the function of the unknown parameters θ because all Xis are already observed and that's why, they are constants within the likelihood function. Hence, even if likelihood points towards the probability of occurrence of events together, the likelihood function is not actually a probability function. Since likelihood is a function of parameters, maximum likelihood estimate focuses on estimating the parameters of the likelihood function such that the same function could attain the maximum value. In other words, the parameter set estimated through MLE will maximize the joint probability of occurrence of the observed instances. Mathematically, ˆθMLE=argmaxθni=1P(Xi|θ) So, if there are k number of parameters to be estimated, by applying the rule of calculus, partial derivatives of likelihood function L is equated to zero for all values of θj,j=1,2,3,...,k. That is Lθj=0,j=1,2,3,...,k This process is consistent with the concept of finding maxima (or minima) of a function concerning the underlying parameter. Thus, for k parameters, there would be k equations and hence a unique solution can be found out. Sometimes, the equations are not closed-form equations (i.e., the parameter is a function of itself as in x=sin(x+y)). In such situations, numerical methods are required to be adopted (e.g., logistic regression) to estimate the parameters. Let us now use MLE in some cases.

Case 1

A coin is tossed 30 times and it is observed that heads came up 21 times. What is the probability of getting a head when it is tossed the next time?

Solution:

The solution is rather easy p=2130=0.70. But this expression p=# of heads# of trials itself is a typical MLE of the probability of success. Let us try to derive it. Let x1,x2,x3,...,x30 are the 30 trials. We assume that the trials are independent of each other and identically distributed (which is quite fair in this situation!). Since there are only two outcomes, the entire process can be assumed to follow a binomial distribution. The probability of observing 21 heads out of 30 trials can be expressed as P(heads=21|p,30)=(3021)p21(1p)(3021) here p is the required probability of heads to be estimated. In the case of a discrete distribution like a binomial distribution, this expression is nothing but a probability mass function (pmf). It captures all the possible ways 21 heads can come up within 30 trials. If we take only one instance out of all the possible cases, the combination operator (3021) will go away and only p21(1p)(3021) will be left out. Since 30 trials are run only once, we can assume that the likelihood function is L=p21(1p)9 Once we obtain the likelihood function, then we can proceed with MLE as shown below: 

L=p21(1p)9Lp=21p(211)(1p)9+p21(1p)91(1)=021p20(1p)9=p21(1p)821(1p)=pp=2130 So, the result obtained earlier and with MLE are same. 

Case 2

Estimate the parameters of a linear regression model using MLE

Solution:

Let us assume that there is a dataset D with n data points in a m dimensional space. Hence, Di=(Xi,yi), i=1,2,3,...,n and XiRm. This is a typical case of multiple regression where there are more than one predictor variable and one target variable. The linear regression model can be written as ˆyi=β0+β1xi1+β2xi2+β3xi3+...+βmxim According to the above equation ˆyi is a linear combination of predictor variables (The linearity assumption). It is to be noted that Xi={xi1,xi2,xi3,...,xim} where xij is the ith value of the jth column. β0 is the intercept. Thus, there are m+1 parameters to be estimated for this model. To apply, MLE, we need to make assumption about the distribution of an entity. Here comes another important assumption of linear regression, the errors are normally distributed. If ei is the error in prediction, ei=yiˆyi and as per the second assumption, eiN(0,σ2). Since error terms are normally distributed, another important assumption comes in implicitly, that is, error variance is homoskedastic in nature, meaning that the error terms are having uniform variations across the spectrum of the predictor variables. Or simply speaking, if error terms are plotted with respect to the target variable or any predictor variable, the scatter plot should look like a white noise without any pattern. The below image is helpful in understanding the meaning of homoskedasticity of error variance. 

Source: https://www.originlab.com/doc/Origin-Help/Residual-Plot-Analysis
With these three assumptions, it is possible to estimate the parameters of the linear regression model. For this we note that if xN(μ,σ2), P(x|μ,σ2)=12πσ2exp(xμ22σ2) where μ represents the mean and σ2 represents the variance of x respectively. Hence, if ei follows a normal distribution with 0 mean and σ2 variance, P(ei|0,σ2)=12πσ2exp(ei022σ2)=12πσ2exp(e2i2σ2) Since there are n data points within the dataset D, likelihood function of error is given by L=ni=1P(ei|0,σ2) The above expression is possible only if individual ei values are uncorrelated with each other. This is the fourth important assumption in linear regression, i.e., error terms are uncorrelated with each other (or absence of autocorrelation of error terms). Since log is a monotonous function, if log(f(x|θ)) attains maximum at some θ, the function f(x|θ) would also attain maximum at the same θ. Since likelihood deals with product of probabilities, we take log of likelihood to convert product of probabilities into addition of log probabilities. Compared to multiplication, addition is easier to deal with while using calculus. Hence, we apply MLE on the log likelihood rather than the likelihood function. If l is denoted as the log likelihood, 
l=ni=1ln(P(ei|0,σ2))=ni=1ln[12πσ2exp(e2i2σ2)]=ni=1ln[12πσ2]+ni=1ln[exp(e2i2σ2)]=Const+ni=1[e2i2σ2]=Const12σ2ni=1e2i As per the above expression, the only way to maximize l is to minimize ni=1e2i because error variance σ2 is constant. Now ni=1e2i is nothing but the sum of square of error terms. This means that log likelihood is going to be maximized if the sum of square of error (popularly denoted as SSE) is minimized. Now, ni=1e2i=ni=1(yiˆyi)2=ni=1(yiXiβ)2=(yXβ)T(yXβ)       in matrix format where β={β0,β1,β2,...βm}=(yT(Xβ)T)(yXβ)=(yTβTXT)(yXβ)=yTyyTXββTXTy+βTXTXβ The above function is a function of parameters β and hence, to minimize the SSE, we need to differentiate the above expression with respect to β. This is where matrix calculus will come into play and we write βni=1e2i=β(yTyyTXββTXTy+βTXTXβ)=0XTyXTy+2XTXβ    because XTX is square symmetric=2XTy+2XTXβFor minima, βni=1e2i=0 for all βj values where j={0,1,2,3,...,m}. Thus we get, 2XTy+2XTXβ=0. After rearrangements, we get XTXβ=XTy β=(XTX)1XTyThe term (XTX)1XT is called the pseudo inverse. The result that has been got is same as that of Ordinary Least Square (OLS) method. Another important assumption is hidden in this expression. If the columns are having multi-collinearity, det(XTX) would be very small and the inverse will have very high (or even infinite) values. That would make the model unreliable and would render rather useless for practical purposes. Hence, one more critical assumption come into play, the variables are uncorrelated with each other. Thus, linear regression comes with 5 very critical assumptions:
  1. The model comprises of linear combination of the predictor variables
  2. The error terms of the model are normally distributed
  3. The error terms have constant variance across the levels of the predictor variables (error variance is homoskedastic in nature)
  4. Error terms do not have any auto-correlation with themselves
  5. Variables are uncorrelated with each other (no multicollinearity)

With MLE of parameters, the likelihood value can also be found out and with this value two more metrics can be looked at to compare model's performance. These are Akaike Information Criteria (AIC) and Bayesian Information Criteria(BIC). AIC=2β2ln(Likelihood) BIC=βln(n)2ln(Likelihood). These two criteria play important roles in deciding the best statistical model out of different models with different number of parameters to estimate. BIC is however more stringent than AIC. A classic use of AIC or BIC is in stepwise regression model generation where the important variables are kept within the model and least important variables are often discarded. 

In this post, I have tried to make the readers understand the process of MLE and how it is used in parameter estimation.  I hope you will find it useful to clarify the concepts.

No comments:

Post a Comment

EM Algorithm and its usage (Part 2) EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is q...