Monday, 27 May 2024

Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do the calculations manually, and that is where iterative methods were never encouraged due to manual labour and hence the possibility of errors in calculation. During that time, those methods were preferred where the number of iterations was less. Maximum Likelihood Estimate (MLE) was one of the most preferred methods in this area and there were several algorithms whose parameters were estimated using MLE. In fact, this method of parameter estimation is still quite popular. In this blog post, I shall discuss about the math behind this wonderful method. 

Before using MLE in the estimation process, it is important to make an assumption on the distribution of either the data points or more importantly, the error terms. For example, to use MLE in estimating the parameters of a regression model, it is assumed that the error terms are normally distributed and, in logistic regression, it is assumed that the target variable is binomially distributed with some probability of success. Without such assumptions, it is impossible to use MLE. To use MLE, it is important to arrive at the expression of likelihood which is nothing but the joint probability of observing the data. To understand it mathematically, $$L=P(X_1, X_2, X_3,...,X_n)$$ where $[X_1, X_2, X_3, ..., X_n] \in P(X_i|\theta) \forall i={1,2,3,...,n}$. Here, $\theta$ is the parameter set of the probability function $f(.)$. For simplicity, in many algorithms, it is assumed that $X_i$s are all independently and identically distributed. That way, the expression of likelihood changes to a rather simpler expression because of the concept of independent events, i.e., $$L=P(X_1, X_2, X_3,..., X_n)=\prod_{i=1}^nP(X_i|\theta)$$ Here $\prod_{i=1}^n(.)$ is the product of individual elements within $()$. Once the expression of likelihood is correctly identified, the likelihood function becomes the function of the unknown parameters $\theta$ because all $X_i$s are already observed and that's why, they are constants within the likelihood function. Hence, even if likelihood points towards the probability of occurrence of events together, the likelihood function is not actually a probability function. Since likelihood is a function of parameters, maximum likelihood estimate focuses on estimating the parameters of the likelihood function such that the same function could attain the maximum value. In other words, the parameter set estimated through MLE will maximize the joint probability of occurrence of the observed instances. Mathematically, $$\hat \theta^{MLE} = \underset{\theta}{\operatorname {argmax}}\prod_{i=1}^nP(X_i|\theta) $$ So, if there are $k$ number of parameters to be estimated, by applying the rule of calculus, partial derivatives of likelihood function $L$ is equated to zero for all values of $\theta_j, \forall j={1,2,3,...,k}$. That is $$\frac {\partial L}{\partial \theta_j}=0, \forall j={1,2,3,...,k}$$ This process is consistent with the concept of finding maxima (or minima) of a function concerning the underlying parameter. Thus, for $k$ parameters, there would be $k$ equations and hence a unique solution can be found out. Sometimes, the equations are not closed-form equations (i.e., the parameter is a function of itself as in $x = sin(x+y)$). In such situations, numerical methods are required to be adopted (e.g., logistic regression) to estimate the parameters. Let us now use MLE in some cases.

Case 1

A coin is tossed 30 times and it is observed that heads came up 21 times. What is the probability of getting a head when it is tossed the next time?

Solution:

The solution is rather easy $p=\frac {21}{30}=0.70$. But this expression $p=\frac {\#\ of\ heads}{\#\ of\ trials}$ itself is a typical MLE of the probability of success. Let us try to derive it. Let $x_1, x_2, x_3,...,x_30$ are the 30 trials. We assume that the trials are independent of each other and identically distributed (which is quite fair in this situation!). Since there are only two outcomes, the entire process can be assumed to follow a binomial distribution. The probability of observing 21 heads out of 30 trials can be expressed as $$P(heads=21|p, 30)={30\choose 21}p^{21}(1-p)^{(30-21)}$$ here $p$ is the required probability of heads to be estimated. In the case of a discrete distribution like a binomial distribution, this expression is nothing but a probability mass function ($pmf$). It captures all the possible ways 21 heads can come up within 30 trials. If we take only one instance out of all the possible cases, the combination operator ${30 \choose 21}$ will go away and only $p^{21}(1-p)^{(30-21)}$ will be left out. Since 30 trials are run only once, we can assume that the likelihood function is $$L=p^{21}(1-p)^{9}$$ Once we obtain the likelihood function, then we can proceed with MLE as shown below: 

$$ \begin{align} L &= p^{21}(1-p)^{9}\\ \frac {\partial L}{\partial p}&= 21p^{(21-1)}(1-p)^9 + p^{21}(1-p)^{9-1}(-1)=0\\ &\implies 21p^{20}(1-p)^9 = p^{21}(1-p)^8 \\ &\implies 21(1-p)=p\\ &\implies p = \frac{21}{30}\end{align}$$ So, the result obtained earlier and with MLE are same. 

Case 2

Estimate the parameters of a linear regression model using MLE

Solution:

Let us assume that there is a dataset $D$ with $n$ data points in a $m$ dimensional space. Hence, $D_i=(X_i,y_i),\ \forall i = {1,2,3,...,n}$ and $X_i \in R^m$. This is a typical case of multiple regression where there are more than one predictor variable and one target variable. The linear regression model can be written as $$\hat y_i=\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + ... + \beta_mx_{im}$$ According to the above equation $\hat y_i$ is a linear combination of predictor variables (The linearity assumption). It is to be noted that $X_i = \{x_{i1}, x_{i2}, x_{i3}, ..., x_{im}\}$ where $x_{ij}$ is the $i^{th}$ value of the $j^{th}$ column. $\beta_0$ is the intercept. Thus, there are $m+1$ parameters to be estimated for this model. To apply, MLE, we need to make assumption about the distribution of an entity. Here comes another important assumption of linear regression, the errors are normally distributed. If $e_i$ is the error in prediction, $e_i = y_i - \hat y_i$ and as per the second assumption, $e_i \sim N(0, \sigma^2)$. Since error terms are normally distributed, another important assumption comes in implicitly, that is, error variance is homoskedastic in nature, meaning that the error terms are having uniform variations across the spectrum of the predictor variables. Or simply speaking, if error terms are plotted with respect to the target variable or any predictor variable, the scatter plot should look like a white noise without any pattern. The below image is helpful in understanding the meaning of homoskedasticity of error variance. 

Source: https://www.originlab.com/doc/Origin-Help/Residual-Plot-Analysis
With these three assumptions, it is possible to estimate the parameters of the linear regression model. For this we note that if $x \sim N(\mu, \sigma^2)$, $$P(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {\parallel x-\mu \parallel^2}{2\sigma^2}\right )$$ where $\mu$ represents the mean and $\sigma^2$ represents the variance of x respectively. Hence, if $e_i$ follows a normal distribution with 0 mean and $\sigma^2$ variance, $$\begin{align}P(e_i|0, \sigma^2) &= \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {\parallel e_i-0 \parallel^2}{2\sigma^2}\right ) \\ &= \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {e_i^2}{2\sigma^2}\right )\end{align}$$ Since there are $n$ data points within the dataset D, likelihood function of error is given by $$L = \prod_{i=1}^nP(e_i|0,\sigma^2)$$ The above expression is possible only if individual $e_i$ values are uncorrelated with each other. This is the fourth important assumption in linear regression, i.e., error terms are uncorrelated with each other (or absence of autocorrelation of error terms). Since $log$ is a monotonous function, if $log(f(x|\theta))$ attains maximum at some $\theta^*$, the function $f(x|\theta)$ would also attain maximum at the same $\theta^*$. Since likelihood deals with product of probabilities, we take $log$ of likelihood to convert product of probabilities into addition of log probabilities. Compared to multiplication, addition is easier to deal with while using calculus. Hence, we apply MLE on the log likelihood rather than the likelihood function. If $l$ is denoted as the log likelihood, 
$$\begin{align} l &= \sum_{i=1}^nln(P(e_i|0,\sigma^2)) \\ &= \sum_{i=1}^n ln \left [\frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {e_i^2}{2\sigma^2}\right ) \right ] \\ &= \sum_{i=1}^n ln \left [\frac{1}{\sqrt{2\pi \sigma^2}}\right ] + \sum_{i=1}^n ln \left [exp \left( - \frac {e_i^2}{2\sigma^2}\right ) \right ] \\ &= Const + \sum_{i=1}^n \left [-\frac{e_i^2}{2\sigma^2} \right ] \\ &= Const - \frac{1}{2\sigma^2} \sum_{i=1}^n e_i^2 \end{align}$$ As per the above expression, the only way to maximize $l$ is to minimize $ \sum_{i=1}^n e_i^2$ because error variance $\sigma^2$ is constant. Now $ \sum_{i=1}^n e_i^2$ is nothing but the sum of square of error terms. This means that log likelihood is going to be maximized if the sum of square of error (popularly denoted as SSE) is minimized. Now, $$\begin{align}\sum_{i=1}^n e_i^2&=\sum_{i=1}^n (y_i-\hat y_i)^2 \\ &= \sum_{i=1}^n (y_i-X_i\beta )^2 \\ &= (y-X\beta)^T(y-X\beta) \ \ \ \ \ \ \ in\ matrix\ format\ where\ \beta=\{\beta_0, \beta_1, \beta_2,...\beta_m\} \\ &=(y^T-(X\beta)^T)(y-X\beta) \\ &= (y^T - \beta^TX^T)(y-X\beta) \\ &= y^Ty - y^TX\beta -\beta^TX^Ty + \beta^TX^TX\beta \end{align}$$ The above function is a function of parameters $\beta$ and hence, to minimize the SSE, we need to differentiate the above expression with respect to $\beta$. This is where matrix calculus will come into play and we write $$\begin{align}\frac{\partial}{\partial \beta}\sum_{i=1}^n e_i^2 &=  \frac{\partial}{\partial \beta}(y^Ty - y^TX\beta -\beta^TX^Ty + \beta^TX^TX\beta) \\ &= 0 - X^Ty - X^Ty + 2X^TX\beta\ \ \ \ because\ X^TX\ is\ square\ symmetric \\ &= -2X^Ty + 2X^TX\beta \end{align}$$For minima, $\frac{\partial}{\partial \beta}\sum_{i=1}^n e_i^2=0$ for all $\beta_j$ values where $j=\{0,1,2,3,...,m\}$. Thus we get, $$-2X^Ty + 2X^TX\beta=0$$. After rearrangements, we get $$X^TX\beta = X^Ty$$ $$\implies \beta = (X^TX)^{-1}X^Ty$$The term $(X^TX)^{-1}X^T$ is called the pseudo inverse. The result that has been got is same as that of Ordinary Least Square (OLS) method. Another important assumption is hidden in this expression. If the columns are having multi-collinearity, $det(X^TX)$ would be very small and the inverse will have very high (or even infinite) values. That would make the model unreliable and would render rather useless for practical purposes. Hence, one more critical assumption come into play, the variables are uncorrelated with each other. Thus, linear regression comes with 5 very critical assumptions:
  1. The model comprises of linear combination of the predictor variables
  2. The error terms of the model are normally distributed
  3. The error terms have constant variance across the levels of the predictor variables (error variance is homoskedastic in nature)
  4. Error terms do not have any auto-correlation with themselves
  5. Variables are uncorrelated with each other (no multicollinearity)

With MLE of parameters, the likelihood value can also be found out and with this value two more metrics can be looked at to compare model's performance. These are Akaike Information Criteria (AIC) and Bayesian Information Criteria(BIC). $$AIC= 2\beta -2ln(Likelihood)$$ $$BIC=\beta ln(n) - 2ln(Likelihood)$$. These two criteria play important roles in deciding the best statistical model out of different models with different number of parameters to estimate. BIC is however more stringent than AIC. A classic use of AIC or BIC is in stepwise regression model generation where the important variables are kept within the model and least important variables are often discarded. 

In this post, I have tried to make the readers understand the process of MLE and how it is used in parameter estimation.  I hope you will find it useful to clarify the concepts.

Friday, 17 May 2024

Understanding the math behind Principal Component Analysis (PCA)

 Let us consider a dataset D in an \(m\) dimensional feature space having \(n\) samples. Mathematically, \(D=\{x_i,y_i\}\) where \(x_i \in R^m\). PCA is unsupervised in nature and hence it does not involve \(y_i\). It is quite possible that the features are having some level of linear correlation and hence, from pure mathematical consideration, the features are not orthogonal to each other. Some of the features may have very high level of multicollinearity whereas the other features may stay reasonably uncorrelated with each other. In PCA, we must consider linear correlation only (i.e., Pearson Correlation Coefficient). PCA can be considered as the process of extracting new features which are aligned along the direction of the stretch (or spread) of the data points. Let us try to understand the same with codes and diagrams. For better understanding, let us focus on a 2D space. The code below creates the datapoints and the plot shows the scatter.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(20)
x = np.random.normal(10,2, size=10000)
y = 4*x + np.random.normal(2,4, size=10000)

z = np.c_[x,y]

sns.scatterplot(x=x,y=y)
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
It is quite evident that the relationship between X and Y is linear. X and Y are orthogonal in this diagram and from a linear regression point of view, this scatter is desirable when Y is the target variable and X is the predictor variable. But, if both X and Y are predictor variables, then this scatter is going to create trouble due to the concept of multicollinearity of features. In the pattern recognition concept, it is quite common to transfer the data from an existing (or input) feature space to a new (or mapped) feature space. It is popularly known as feature engineering. So, here comes the question, "What is the meaningful way of shifting the data from the current input space to a new feature space?". For instance, have a look at the following new feature spaces.
Out of the four different feature spaces, which one is most meaningful? The answer is, probably the first one. This is because of the fact that the features are trying to align them along the direction of the spread of the data. But, can it be optimized even more? The answer is yes. We apply PCA.

When we want to transfer data points from one feature space to another feature space with linear mapping (i.e., by using a linear function), we need a transformation matrix. Let us assume that U is the required transformation matrix such that, \[ U=\left [\matrix {|&|&|&&&| \\ u_i & u_2 & u_3 & . & . & u_m \\ |&|&|&&&| } \right ] \] where \(u_i\) is a column vector with unit length. It is also to be understood that \(u_i\) is a vector representing the \(i^{th}\) axis of the new feature space. If a simple dot product is taken between \(X_i=[x_{i1}, x_{i2}, x_{i3},...,x_{im}]\) and \(U\), the resultant outcome will be the projection of the \(i^{th}\) data point in the input space to the \(i^{th}\) data point in the new feature space. The data point will remain in the same location, only the coordinates will change due to feature space mapping. Since the pattern within the data will remain the same irrespective of such linear feature space mapping, the pattern will also remain unchanged with shifting of origin to any arbitrary location linearly. Hence, it is convenient to shift the origin to the centroid of the dataset. In the above image also, the same is done for the new features, i.e., the origin of the new feature space is located at the centroid of the dataset. Let \(X_c\) be the centered dataset such that the origin is located at the centroid of the dataset.

Now we can focus on the objective of PCA. Our objective is to align the new features along the stretch (or spread) of the dataset. That is, along a \(j^{th}\) dimension, the variance explained should be maximum. Let \([a_{j1}, a_{j2}, a_{j3},...,a_{jn}] = <X_c, U_j>\) are the projections of the data point in the input space to the \(j^{th}\) dimension of the new feature space. Let \(\sigma_j^2\) is the variance along the \(j^{th}\) dimension in the new feature space. It is to be noted that as the origin was shifted to the centroid of the dataset, \[\mu_j = \frac{1}{n}\sum_{i=1}^n a_{ji} =0 \]

Now, we can calculate the variance of \([a_{ij}]\ \forall i \in [i,2,3,...,n]\) as: 

$$ \begin{align} \sigma_j^2 & = \frac{1}{n}\sum_{i=1}^n (a_{ij}-\mu_j)^2 \\ & = \frac{1}{n}\sum_{i=1}^n (a_{ij})^2 \\ & = \frac{1}{n} (X_cU_j)^T(X_cU_j) \\ & = \frac{1}{n}U_j^TX_c^TX_cU_j \\ & = U_j^T\left (\frac{X_c^TX_c}{n}\right )U_j \\ & = U_j^T \Sigma U_j \end{align} $$

Note that when the original data is centered, $\frac{X_c^TX_c}{n}$ is the covariance matrix and the covariance matrix is a square symmetric matrix. The objective function is to maximize $\sigma_j^2$ subjected to $U_j^TU_j=1$ because $U_j$ is a unit vector. This is a constrained optimization problem which can be solved by using LaGrange's multiplier method. $$L=U_j^T\Sigma U_j - \alpha_j(U_j^TU_j-1)$$ where $\alpha$ is the Lagrange's Multiplier. The objective is to find $U_j$ that maximizes $\sigma_j^2$ and hence, we need to equate $\frac{\partial L}{\partial U_j}=0$. This leads to $$2\Sigma U_j - 2\alpha_j U_j = 0$$ From here, it is clear that $$\Sigma U_j = \alpha_j U_j$$ The above equation is well known as the eigen decomposition of a square matrix. This means that the principal component $U_j$ is essentially the eigenvector of the covariance matrix. Since the covariance matrix is a square symmetric matrix of size $(m \times m)$, there will be exactly $m$ eigenvectors with $m$ eigenvalues. Putting the value of $\Sigma U_j$ is the expression of $\sigma_j^2$, we see that $$\sigma_j^2=U_j^T\Sigma U_j = U_j^T\alpha U_j=\alpha_j U_j^TU_j = \alpha_j$$ Thus, the eigenvalue represents the amount of variance explained by the respective principal component along the $j^{th}$ dimension. An added benefit from this analysis is that the principal components will be perfectly orthogonal to each other because eigenvectors of a square symmetric matrix are always orthogonal to each other. 

Covariance is not bounded and the variable which has higher variance always tries to dominate over other variables (or features). Hence, instead of covariance, a correlation matrix is preferred which is essentially the covariance matrix of the standardized data. 

The final principal components for the sample data are shown in the figure along with the codes.

# find correlation coefficient of z
cor_mat = np.corrcoef(z, rowvar=False)
eig_val, eig_vec = np.linalg.eig(cor_mat)

x_mean = x.mean()
y_mean = y.mean()

plt.figure(figsize=(6,6))
sns.scatterplot(x=x,y=y)
plt.quiver([x_mean, x_mean], [y_mean, y_mean],  eig_vec[0,:], eig_vec[1,:],
           [3,10], scale=5, cmap='flag')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

 

PCA as a method of dimension reduction

 

After eigen decomposition of the correlation matrix, we also get another interesting outcome. $$\Sigma_{j=1}^m\sigma_j^2 = \Sigma_{j=1}^m\alpha_j = trace(Corr Matrix)=m$$ After PCA, the eigenvalues are sorted in a descending manner and the eigenvectors are also rearranged with respect the sorted eigenvalues. Then we choose $r$ number of principal components such that $$\frac{1}{m}\Sigma_{j=1}^r\alpha_j \ge \epsilon$$ with the minimum value of $r$ so that the total variance retained is at least $\epsilon$ percentage of the total variance. Since $r < m$, with r number of mutually independent dimensions $m$ dimensional dataset is represented with some loss of information. Apart from this method, there is a criterion called the Kaiser criteria that can be used to extract a lesser number of dimensions. As per the Kaiser criteria, only those eigenvectors are extracted for which the corresponding eigenvalues are more than 1. The third method is the choice of the analyst where the analyst chooses the best number of principal components. 

Implicit Assumptions


Since PCA deals with the covariance matrix, it is associated with the implicit assumptions:
  1. The variables should be normally distributed (so that mean and variance make sense)
  2. Principal components are the linear combination of the input variables
I hope that the readers will find this post helpful in understanding the concept.

Comments are welcome!






Thursday, 16 May 2024

Importatnt Probability Distribution (Part 2)

MathJax example

In the previous post we had seen binomial and Poisson distributions. They are discrete in nature. Similarly, continuous probability distributions are also there. One of the most important distribution is normal distribution. Normal distribution for a one dimensional data has two parameters, i.e. mean \(\mu\) and standard deviation \(\sigma\). The probability density function (pdf) is defined as: \[f(x|\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}} e^{[-\frac{(x-\mu)^2}{2\sigma^2}]}\] For a standard normal distribution, \(\mu\) = 0 and \(\sigma\) = 1 which leads to: \[ f(x|0,1) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} \] However, for multivariate scenario, the expression becomes complex. For multivariate normal distribution, we need to look at the covariance matrix \(\Sigma\) and the mean vector \(\mu\). The expression of pdf changes to: \[ f(x|\mu, \sigma) = \frac{1}{{(2\pi)^{d/2}}|\Sigma|^{\frac{1}{2}}}exp[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)]\] where \(d\) is the dimension of the multivariate distribution. Normal distribution has applications in hypothesis testing in statistical inferences. More importantly, when the population standard deviation is known, this distribution can be used effectively to determine if sample means of two samples are statistically significantly different or not. However, in most of the situations, population standard deviation is not known. Hence, we need to work with another distribution that behaves similar to a normal distribution when the sample size is more and yet can be used in situations where the sample size is small (say 10) and more importantly, when the population standard deviation is unknown. Student t distribution has all the required characteristics to deal with the above three situations and hence, for statistical hypothesis tests, student t distribution is more preferred. The equation of student t distribution is given by: \[ \frac{\Gamma \left(\frac{\nu +1}{2}\right)}{\sqrt{\pi \nu}\Gamma(\frac{\nu}{2})}\left(1+\frac{x^2}{\nu}\right)^{-(\frac{\nu+1}{2})}\] where \(\nu\) is the degree of freedom and \(\Gamma[.]\) is the Gamma function. The above function is applicable for one dimensional data. For multivariate situation, the same expression changes to: \[ \frac{\Gamma \left(\frac{\nu +p}{2}\right)}{(\pi \nu)^{\frac{p}{2}}\Gamma(\frac{\nu}{2})}\left(1+\frac{(x-\mu)^T\Sigma^{-1}(x-\mu)}{\nu}\right)^{-(\frac{\nu+p}{2})}\] where \(p\) is the dimension of the data and \(\nu\) is the degree of freedom. If both the multivariate functions of both the distributions are observed carefully, they both have the component \((x-\mu)^T\Sigma^{-1}(x-\mu)\) and this part is, essentially, the squared Mahalanobis distance. The characteristics of this distance is that it is unitless and it is scale-invariant in nature. Hence, if the data under consideration is following a multivariate normal distribution, Mahalanobis distance can  be used to detect the presence of multivariate outliers. 

To understand the use, a synthetic dataset is created with only two variables so that the spread can be shown diagrammatically. The codes for generating data and viewing them is shown below.

import numpy as np
import seaborn as sns
import pandas as pd

n_inliers = 100
n_outliers = 20
n_features = 2

# Generate covariance matrix for inlier data and outlier data
inliers_cov_mat = np.array([[2,0],[0,1]])
outliers_cov_mat = np.array([[5,0],[0,2]])

np.random.seed(1) # Fix the seed for consistent output

# generate inlier and outliers data points
inlier_data = np.random.normal(size=(n_inliers, n_features)) @ inliers_cov_mat
outlier_data = np.random.normal(size=(n_outliers, n_features)) @ outliers_cov_mat

# Join the data points
final_data = pd.DataFrame(np.vstack([inlier_data, outlier_data]), 
                          columns=['x','y'])
final_data['labels'] = np.repeat(['inliers','outliers'],[100,20]) # Labels only for identification

# Generate the scatter plot for visualization
sns.scatterplot(data=final_data, x='x', y='y', hue='labels')

The image shown below shows the scatter plot with normal (or inlier) data points and outliers. Note that not all the marked outliers are actually outliers. Some of the marked outliers are very much within the region of inliers. But there are several outliers existing within the dataset. The objective here is to identify them using statistical outlier detection method, i.e., by using Mahalanobis Distance measures. This method of outlier detection is multivariate in nature and hence more appropriate while analyzing multivariate data.

Covariance matrix play an important role in calculating Mahalanobis distance and covariance is also influenced by the presence of outliers. That is why, multiple covariance matrices are estimated based on samples of the data and that covariance matrix is chosen which has the lowest determinant score. The smaller the determinant score, the closely packed the data points, i.e., plausible inliers. Python package scikit-learn has the MinCovDet (MCD) class for the said purpose. It tries to find \(\frac{n_{samples}+n_{features}+1}{2}\) samples which has the lowest determinant score, leading to a pure subset of samples from the existing dataset. We apply the same method on the above mentioned dataset to get a robust estimation of the covariance matrix that can be used subsequently to detect outliers. The code snippet is given below:
 
from sklearn.covariance import MinCovDet

mcd = MinCovDet(random_state=123)
mcd.fit(X=final_data.loc[:,['x','y']])
print(f"The robust covariance matrix is: \n\n{mcd.covariance_}")

# OUTPUT
# ======
# The robust covariance matrix is: 

# [[2.55655394 0.05976092]
#  [0.05976092 0.9295315 ]]

It can be seen that the covariance matrix is similar to that of the covariance matrix created for the inliers. The 'mcd' object also contains the Mahalanobis distance of the data points in the input dataset. This distance is calculated with respect the mean vectors associated with the data points used in calculating the robust covariance matrix. These distances are one dimensional in nature and hence, the outliers can be extracted based on boxplot or IQR based methods. The following code snippet shows the final outlier detection process.

def extract_univariate_outlier_IQR(array_1D):
    percentile_values = np.percentile(array_1D, q=[25,75])
    iqr = percentile_values[1] - percentile_values[0]
    lower_cutoff = percentile_values[0] - 1.5*iqr
    upper_cutoff = percentile_values[1] + 1.5*iqr
    out = np.where((array_1D > upper_cutoff) | (array_1D < lower_cutoff), 1, 0 )
    return out
    

outliers = extract_univariate_outlier_IQR(mcd.dist_)

# Generate the scatter plot for visualization
sns.scatterplot(data=final_data, x='x', y='y', hue=outliers)



The above plot clearly shows the outliers. It is to be noted that this process is applicable for unimodal distribution. For multimodal distribution, this process would be erroneous. In such scenarios, other methods such as Local Outlier Factor, Isolation Forest or even SVM one class novelty detection can be used.

In this post, I have tried to connect use of normal distribution to one important analysis in machine learning, i.e., outlier detection. I hope this would be helpful.  

Friday, 22 July 2022

Important Probability Distributions

Statistics rests heavily on the assumptions of probability distributions. We all have heard about linear regression and, probably, have some or other linear regression model, rather mechanically. But it is to be understood that this statistical model is also relying on the underlying assumptions related to the distributions of errors. Technically speaking, there are two types of distributions, i.e., discrete and continuous. Outcomes of tossing of coins, the arrival of customers, and winning a bet after losing successively 'n' times are typical examples of discrete outcomes and hence they can be modeled with discrete probability distributions. Other events such as the time before the failure of bulbs, inter-arrival time between two customers, or choosing one set of distributions over another are examples of events that can be modeled using continuous distributions. The figure below shows a few commonly used discrete and continuous distributions.



In this blog, we would restrict ourselves to only a few such distributions which are encountered more frequently in data analytics.

Binomial Distribution

Binomial distribution deals with only 2 outcomes, i.e., success and failure. It is up to us to decide when an event would be considered a success. For example, when tossing a coin, we may consider "Head" as a success. But this choice is rather arbitrary. We may also choose "Tail" as a success too. But, it is customary to consider "Head" as success. Each distribution is defined by means of some parameters. For binomial distributions, the parameters are the number of trials (denoted by N) and the probability of success (denoted by 'p'). So, if there are x number of successes out of N trials, the probability of getting exactly x successes is given by: \[\begin{equation} P(x|N,p)={n \choose x} p^x(1-p)^{N-x} \end{equation}\]Here ${n \choose x}$ denotes the combination. This combination term is important because if there are N number of trails, x number of successes can happen in ${n \choose x}$ ways and for each combination, the associated probability is $p^x(1-p)^{N-x}$. The probability got in the case of the discrete event is called the probability mass function (pmf). PMF is defined for a particular point (exactly x number of successes in this case) of a discrete probability distribution. Two important properties related to PMF are \[\begin{align}\sum_{x_i} P(x_i) = 1 \\  P(x_i) \ge 0 \end{align}\]Binomial distribution is used in parameter estimation of logistic regression model where the objective is to estimate the probability of success given some extra information. Logistic regression will be discussed separately in another blog. 

Example

Unguided missiles are not accurate to hit a target from a distance. During a war, a battery of unguided missiles is fired on a bunker. The bunker will be completely destroyed if four missiles hit it. If the probability of hitting the target for a missile is 60%, how many missiles are to be fired to destroy the bunker with an above 90% probability?

Solution

For complete destruction, a minimum of 4 hits are necessary. That means, the required probability is \[1-P(No\ hit (x=0)|N,0.6)-P(One\ hit (x=1)|N,0.6)-P(Two\ hits (x=2)|N,0.6)-\\P(Three\ hits (x=3)|N,0.6)\]Here N is unknown and the required minimum probability is 0.9. Hence, we need to solve the following equation: \[0.9=1-P(x=0|N,0.6)-P(x=1|N,0.6)-P(x=2|N,0.6)-P(x=3|N,0.6)\]\[0.1=0.4^N+N(0.6)(0.4)^{N-1}+\frac{N(N-1)}{2}(0.6^2)(0.4)^{N-2}+\frac{N(N-1)(N-2)}{6}(0.6^3)(0.4)^{N-3}\]The above problem cannot be solved easily and we need to use numerical methods to find the solution. We can use the scipy package in Python to solve the equation 

Python Code
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve

p=0.6
func = lambda N: 0.1 - ((1-p)**N + N*p*(1-p)**(N-1) + N*(N-1)/2*((p)**2)*(1-p)**(N-2) + N*(N-1)*(N-2)/6*((p)**3)*(1-p)**(N-3))

initial_guess = 4
int(np.round(fsolve(func, initial_guess,maxfev=1000),0)[0])

The solution is "at least 9 rockets are to be fired to be 90% sure of destroying the bunker". 

In logistic regression, when we use MLE to estimate model parameters, we use this distribution to model the likelihood of the occurrences of the target variable. The next important discrete distribution is the Poisson Distribution.

Poisson Distribution

Poisson's distribution is another discrete distribution that basically deals with count data. Poisson's distribution is used in Poisson regression and some other statistical tests. Unlike binomial distribution, Poisson's distribution give the probability of occurrence of a given number of events within a fixed interval of time having a constant mean rate and independently of the time since the last event. The above sentence is long and (probably) confusing. To understand in a better way, let us imagine a road crossing and a fixed time frame (say) morning 9 AM to 10 AM (i.e., fixed time interval). If we assume that, on an average, 10 (i.e., mean rate) vehicles crosses the crossing during that time. When a vehicle crosses the crossing, it does not impact the occurrence of another vehicle crossing the same crossing (i.e., independent of the last event). Now, on a day, the number can be 5, 15, 25, 30 or even 50. While 5 and 15 will have higher probability of occurrence, 25, 30 and 50 will have lesser probabilities of occurrences (but not zero!). This can be modeled with Poisson distribution with mean rate 10. Mathematically, Poisson's distribution is defined as \[P(x;\lambda)=\frac{\lambda^x e^{-\lambda}}{x!}\]Here, $\lambda$ is the mean rate and $x$ is the number of occurrence of events. It is to be understood that $P(x;\lambda)$ is the probability mass function.

Example

An ENT specialist, on an average, visits 10 patients in a day. What would be his expected earning per day if the number of patients vary from 15 to 20, (following a uniform distribution) and he charges INR 1000 per patient for the visit? Note: 15-20 patients is a special case which does not disturb the mean rate.
 

Solution

This problem can be solved by assuming the number of visits of patients follow a Poisson distribution with mean rate 10 (i.e., $\lambda = 10$). The number of patients varies from 15 to 20 (i.e., $x = 15,16,17,18,19,20$). We need to calculate the revenue associated with each number of patients and that need to be multiplied by the associated Poisson probability for finding out the expected earning. Expected earning $E$ is \[E=\sum_{x=15}^{20}P(x;\lambda)(1000x)\]It is important to note that before evaluating the final expected values, probabilities are to be rationalized to make sure that the sum of probabilities becomes exactly one. To solve this problem, it is better to use either R or Python code. I show here the Python code.
 
 Python Code
from scipy.stats import poisson

prob = []
for x in range(15,21):
    prob.append(poisson.pmf(x,10))

prob_final = prob/sum(prob)
E = sum(1000*np.array(range(15,21))*prob_final)
print(E)

The final result is INR 16133. In a separate post, I shall elaborate the Poisson regression with mathematical details.



Monday, 28 June 2021

Missing value Imputation

In data analysis, missing values create lots of troubles. There are several algorithms which are simply incapable of handling missing values properly and fail during run-time. Take the case of KMeans clustering. Presence of missing values will not allow to estimate the centroid figures properly and that would lead to either total failure or simply unreliable (or potentially biased) estimates of the centroid figures. Similarly, in linear regression, if missing values are encountered, it tries to remove the entire row containing the missing value before estimating the model parameters and that leads to unreliable parameter estimates. Random Forest, a more accurate machine learning method, is also incapable of handling missing values unless it replaces the missing values through some means. Xgboost, LightGBM etc. are capable of handling missing values due to the presence of provisions. They put data points containing missing values either on the left hand side bucket or on the right-hand side bucket such that the loss is minimized. This is possible because they use a loss function which is quite different from that of a normal decision tree. But, in general, missing values are nothing but menace to data analytics. Hence, there are quite a few methods available to deal with this problem.

Simple methods

Sometimes, dealing with simple missing values is as simple as dropping the rows containing the missing values. This is called list-wise deletion. This method is preferred when we see that the deleted number of rows is less than 5% of the total number of rows. It is, a kind of, rule of thumb. However, if the percentage is more than 5%, then missing values are imputed. It is to be kept in mind that missing values are, sometimes, legitimate, and they must NOT be imputed. This is commonly seen in the responses of large scale surveys where funneling questions are asked such as "If the answer to Q4 is Yes then go to Q7". In such situations, Q5 and Q6 are bound to get missing values. An analyst must not try to impute these missing values because that would distort the responses significantly. Hence, care must be taken before imputing these type of missing values. 

Before handling missing values, we need to look at what type of analyses are required to be done. There are some calculations (bi-variate analysis) where only two variables (or vectors) are involved. Common examples of such analyses are, correlation analysis and pairwise distance calculation. Since, at a time, only two variables are involved, they can use pair-wise deletion (i.e. list-wise deletion for the pair of variables) method instead of list-wise deletion and retain a significant amount of information which would have, otherwise, lost in the later process. Hence, any algorithm which uses correlation matrix (e.g. PCA) or distance matrix (MDS), can get benefits from such a method of missing value treatment. However, it becomes difficult to generalize the outcomes of these processes because, for the unseen data with missing values, the correlation matrix or the distance matrix could be significantly different. That is why, any such deletion method must be administered rather carefully. 

Usually, missing values contained in numeric variables are imputed using simple techniques such as impute by series means and impute by series median. Mean is a characteristic point of a distribution and when the distribution of the variable is not significantly different from a normal distribution, mean based imputation is preferred. But, if the distribution is skewed, median is a better choice because median is robust compared to mean. Mean value gets highly affected by the presence of outliers. For discrete value variables (categorical variables), the simplest method is to impute the missing values by the mode value (or the most frequent value). 

Simple methods work, and they are easy to implement. However, while imputing by the simple methods, the relationships among the variables are completely ignored. This could also induce some biases. That is why, model based methods are also required to be explored.

Model based methods

There are many model based approaches available. In this post, I shall discuss the iterative method which imputes missing values in an iterative manner. The process is explained below.

Step 1: Impute missing values with either series mean or median for numeric variables and series mode for discrete data. Keep track of the row indices of the missing values for each variable.

Step 2: Arrange the variables in the ascending order of occurrence of missing values

Step 3: Start imputation with the variable having the lowest number of missing values. Make it the dependent variable (or target variable) and make all other variables (except the target variable) as the predictor variable.

Step 4: Convert the imputed values to missing values once again for the dependent variable only. 

Step 5: Train a statistical model or a machine learning algorithm to predict the values of the dependent variable based on the non-missing values. Use the model to predict the values of the missing values of the dependent variable. Use regression model for numeric variable and classification model for discrete variable.

Step 6: Go to the next variable and follow step 2 to step 5

Step 7: When all the variables are imputed, one iteration (or epoch) is over. Repeat the entire process prefixed number of iterations (or epochs). Sometimes, to make sure that the predicted values are not too far off from the actual value, a predictive mean matching algorithm is also used at step 4. However, in this blog, we would not use it (mainly to reduce the complexity).

Python code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from collections import Counter
from lightgbm import LGBMRegressor,LGBMClassifier
from tqdm import tqdm, tqdm_notebook 

 

class LGBM_Impute():
    def __init__(self, dataset, target_var=None, param_lgbm_reg = None, param_lgbm_cls = None, n_epoch=5):
        self.dataset = dataset
        self.taget = target_var
        self.param_lgbm_cls = param_lgbm_cls
        self.param_lgbm_reg = param_lgbm_cls
        self.n_epoch = n_epoch
        
        self.data_mask = self.dataset.isnull()
        
    
    def label_encode(self, x):
        unique_labels = list(np.sort(np.unique(list(x))))
        unique_labels = [y for y in unique_labels if y != 'nan']
        print(unique_labels)
        return {k:v for k, v in zip(unique_labels, list(range(len(unique_labels))))}
    
    def impute(self):
#         var_types = [self.dataset[x].dtype for x in self.dataset.columns]
        cat_vars_index = [index for index in range(self.dataset.shape[1]) if self.dataset.iloc[:,index].dtype == 'object']
        num_vars_index = [index for index in range(self.dataset.shape[1]) if self.dataset.iloc[:,index].dtype != 'object']
        
        self.data_encoded = self.dataset.copy()
        
        original_var_order = self.dataset.columns
        
        n_missing = self.data_mask.sum(axis=0)
        sorted_indices = np.argsort(n_missing)
        
        self.dataset = self.dataset.iloc[:, sorted_indices]
        
        self.encode_list = []   
        for col_index in cat_vars_index:
            label_map = self.label_encode(self.data_encoded.iloc[:,col_index])
            self.data_encoded.iloc[:,col_index] = self.data_encoded.iloc[:,col_index].map(label_map)
            self.encode_list.append((col_index, label_map))            
            
        self.key_value_pair = {k:v for k, v in self.encode_list}
#         self.value_key_pair = {v:k for k, v in self.encode_list}
        
        for col in range(self.data_encoded.shape[1]):
            if col in cat_vars_index:
                C = Counter(self.data_encoded.iloc[:,col])
                self.data_encoded.iloc[:,col].replace(np.nan, list(C.keys())[0], inplace=True)
            else:
                M = self.data_encoded.iloc[:,col].median()
                self.data_encoded.iloc[:,col].replace(np.nan, M, inplace=True)
                
        
        if self.param_lgbm_reg == None:
            reg = LGBMRegressor()
        else:
            reg = LGBMRegressor(**param_lgbm_reg)
            
        if self.param_lgbm_cls == None:
            cls = LGBMClassifier()
        else:
            cls = LGBMClassifier(**param_lgbm_cls)
            
                
        for epoch in tqdm(range(self.n_epoch)):
            for col in range(self.data_encoded.shape[1]):
                if self.data_mask.iloc[:,col].sum() == 0:
                    continue
                else:
                    df_train = self.data_encoded[~self.data_mask.iloc[:,col]]
                    df_test = self.data_encoded[self.data_mask.iloc[:,col]]
                    
                    df_test.drop(self.data_encoded.columns[col], axis=1, inplace=True)
#                     print(df_test.shape)
                    
                    X = df_train.drop(self.data_encoded.columns[col], axis=1)
                    Y = df_train.iloc[:,col]
                    
                    if col in cat_vars_index:
                        cls.fit(X,Y)
                        pred = cls.predict(df_test)
                        df_test[self.data_encoded.columns[col]] = pred
                        self.data_encoded = pd.concat([df_train, df_test])
                    
                    else:
                        reg.fit(X,Y)
                        pred = reg.predict(df_test)
                        df_test[self.data_encoded.columns[col]] = pred
                        self.data_encoded = pd.concat([df_train, df_test])
#             print("End of epoch {}".format(epoch))
        
        for col in range(self.data_encoded.shape[1]):
            if col in cat_vars_index:
                value_key_pair = {v:k for k, v in self.key_value_pair[col].items()}
                self.data_encoded.iloc[:,col] = self.data_encoded.iloc[:,col].map(value_key_pair)
                
        final_imputed_data = self.data_encoded.sort_index()
        
        return final_imputed_data[original_var_order] 
 

Performance of imputation is very difficult to gauge due to non availability of correct value. Readers can try to evaluate the performance by artificially creating missing values and using the routine to impute the missing values. Thereafter, the imputed values can be compared with actual values to evaluateits performance. LGBM is a flexible ML algorithm and hence, choice of hyperparameters will have their impacts on the final quality of imputation. 

I hope that you will find this blog informative. 

Comments are welcome..

Sunday, 2 August 2020

Understanding gradient descent

In machine learning, numerical optimization plays a critical role. Take the case of simple logistic regression. In logistic regression, model parameters are obtained using the MLE concept (primarily). But, parameter estimation is not as simple as in the case of linear regression. This happens because, at one step of this estimation process, equations of parameters are formed which are not in closed form. In such cases, we are bound to use numerical methods to get plausible solutions. Similarly, when we try to minimize a loss function in an ML algorithm, in most cases, the loss function turns out to be non-convex. In those situations also we need to adopt numerical optimization techniques. In literature, we find several numerical optimization techniques with relative merits and demerits. But, when speedy execution is needed, the most common method of optimization is gradient-based methods. If looked carefully, the entire set of deep learning architectures is mainly relying on some or other form of gradient-based optimization techniques. Hence, in this post, two such basic but important gradient-based optimization techniques are going to be discussed. These techniques are:

  • Steepest Descent with variable step size
  • Gradient Descent with, mostly, constant step size

To understand the methods rather easily, we would look at a 3D surface which is essentially convex in nature. Let us assume that a 3D surface is created by the function:
$$ f(x,y) = 2x^2 + 5y^2 - 1000 $$
This is a very simple function that is convex in nature and it has a global minimum at x=y=0. The function, if plotted will look like a parabolic surface as shown below.


In the case of gradient descent, we need to first calculate the gradient of the function at the current point. A gradient is, essentially, a vector which stays perpendicular to the surface at the point where the gradient is calculated. It points towards the direction of movement which would maximize the function value in the neighborhood location. Let us find the gradient vector first for the given surface. 
$$\frac{\partial f(x,y)}{\partial x} = 4x$$
$$\frac{\partial f(x,y)}{\partial y} = 10y$$
$$\nabla = [4x, 10y]_{x=x^{(c)}, y=y^{(c)}}$$
The same surface can be shown as a contour plot and when the gradient is calculated, it points in the outward direction as shown in the figure below. The gradient is calculated at [5.05, 5.45]

If we move along the direction of the arrow, the value of the function will increase. But we need to decrease the value. Hence, instead of moving along the direction of the gradient, we need to move in the opposite direction. That is why the term gradient descent is given. So, now we know, from the initial starting point, along which direction we should move to lower the value of the function. Look at the figure below:

Steepest Descent
As we move along the line in the opposite direction of the gradient, we can keep calculating the gradients and find the respective gradient vectors at those points. It would be seen that the angle between the original gradient vector and the other gradient vectors, as calculated along the line, would increase continuously. Have a look at the movement of gradient vectors as we move along the line.


From the concept of vector algebra, we know that two vectors would be independent of each other if they are perpendicular to each other. This is where the line search comes in. Along this line, it is possible to find a particular point where the gradient vector will become exactly perpendicular to the previous gradient vector. Why is this point important? It is because, till this point, if we proceed, we would get the maximum reduction of the function. Beyond this point, if we proceed, the amount of reduction will decrease. Line search is an iterative process because finding the exact location of the next point where the gradient is absolutely perpendicular to the previous gradient is extremely expensive computationally. Hence, approximate methods are used. In 1969, Peter Wolfe suggested two conditions to find the step length.
$$f(x_i+\alpha_ip_i)<f(x_i)+c\alpha \nabla f_i^Tp_i$$
$$\nabla f(x_i+\alpha p_i)^Tp_i\geq f(x_i)+c_1\nabla f_i^Tp_i$$
In the above conditions, c1>c but both c and c1 are less than 1. The hyperparameter c is chosen to be quite small. $p_i$ is the direction of the step. If the loss minimization is done by Newton Method, this condition serves well. Goldstein, proposed another condition which can be used in line search. The condition is:
$$f(x_i)+(1-c)\alpha_i \nabla f_i^Tp_i \leq f(x_i+\alpha_ip_i)\leq f(x_i)+c\alpha_i \nabla f_i^Tp_i$$

Thus, in deepest descent, both gradient and step size are required to be calculated. If the function to be minimized is a simple function as in the present case, deepest descent will quickly reach the minima within a few iterations. The solution can be seen in the figure below.

But in case of functions having several local minima, this method will become slow and will follow a very zigzag pattern before reaching the solution. That is why we usually keep the step size constant and small (we call it learning rate) so that the descent part becomes much more smoother. Equation of gradient descent is given below:
$$\theta^n = \theta^o - \eta \nabla L$$
where $\eta$ is the learning rate and $\nabla L$ is the gradient at the current location. In this problem, when eta is kept low (at 0.02) the gradient descent reached the minima in a very smooth manner as shown in the figure.

In deep learning or in any complex machine learning algorithm, if gradient descent is used (e.g. GBM, LightGBM, Xgboost etc.), the deepest descent is usually avoided because the error surface is very uneven and, in many situations, are ill-conditioned. In those situations, gradient descent with (mostly) static learning rate used. Particularly in Deep Learning, different modified optimizers are used which uses gradient information a lot (e.g. Rmsprop, Adam, Adamax etc.). 

I hope you will find this post informative.

Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do...