Analytics for Everyone: May 2024

Monday, 27 May 2024

Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do the calculations manually, and that is where iterative methods were never encouraged due to manual labour and hence the possibility of errors in calculation. During that time, those methods were preferred where the number of iterations was less. Maximum Likelihood Estimate (MLE) was one of the most preferred methods in this area and there were several algorithms whose parameters were estimated using MLE. In fact, this method of parameter estimation is still quite popular. In this blog post, I shall discuss about the math behind this wonderful method.

Before using MLE in the estimation process, it is important to make an assumption on the distribution of either the data points or more importantly, the error terms. For example, to use MLE in estimating the parameters of a regression model, it is assumed that the error terms are normally distributed and, in logistic regression, it is assumed that the target variable is binomially distributed with some probability of success. Without such assumptions, it is impossible to use MLE. To use MLE, it is important to arrive at the expression of likelihood which is nothing but the joint probability of observing the data. To understand it mathematically, $$L=P(X_1, X_2, X_3,...,X_n)$$ where $[X_1, X_2, X_3, ..., X_n] \in P(X_i|\theta) \forall i={1,2,3,...,n}$. Here, $\theta$ is the parameter set of the probability function $f(.)$. For simplicity, in many algorithms, it is assumed that $X_i$s are all independently and identically distributed. That way, the expression of likelihood changes to a rather simpler expression because of the concept of independent events, i.e., $$L=P(X_1, X_2, X_3,..., X_n)=\prod_{i=1}^nP(X_i|\theta)$$ Here $\prod_{i=1}^n(.)$ is the product of individual elements within $()$. Once the expression of likelihood is correctly identified, the likelihood function becomes the function of the unknown parameters $\theta$ because all $X_i$s are already observed and that's why, they are constants within the likelihood function. Hence, even if likelihood points towards the probability of occurrence of events together, the likelihood function is not actually a probability function. Since likelihood is a function of parameters, maximum likelihood estimate focuses on estimating the parameters of the likelihood function such that the same function could attain the maximum value. In other words, the parameter set estimated through MLE will maximize the joint probability of occurrence of the observed instances. Mathematically, $$\hat \theta^{MLE} = \underset{\theta}{\operatorname {argmax}}\prod_{i=1}^nP(X_i|\theta) $$ So, if there are $k$ number of parameters to be estimated, by applying the rule of calculus, partial derivatives of likelihood function $L$ is equated to zero for all values of $\theta_j, \forall j={1,2,3,...,k}$. That is $$\frac {\partial L}{\partial \theta_j}=0, \forall j={1,2,3,...,k}$$ This process is consistent with the concept of finding maxima (or minima) of a function concerning the underlying parameter. Thus, for $k$ parameters, there would be $k$ equations and hence a unique solution can be found out. Sometimes, the equations are not closed-form equations (i.e., the parameter is a function of itself as in $x = sin(x+y)$). In such situations, numerical methods are required to be adopted (e.g., logistic regression) to estimate the parameters. Let us now use MLE in some cases.

Case 1

A coin is tossed 30 times and it is observed that heads came up 21 times. What is the probability of getting a head when it is tossed the next time?

Solution:

The solution is rather easy $p=\frac {21}{30}=0.70$. But this expression $p=\frac {\#\ of\ heads}{\#\ of\ trials}$ itself is a typical MLE of the probability of success. Let us try to derive it. Let $x_1, x_2, x_3,...,x_{30}$ are the 30 trials. We assume that the trials are independent of each other and identically distributed (which is quite fair in this situation!). Since there are only two outcomes, the entire process can be assumed to follow a binomial distribution. The probability of observing 21 heads out of 30 trials can be expressed as $$P(heads=21|p, 30)={30\choose 21}p^{21}(1-p)^{(30-21)}$$ here $p$ is the required probability of heads to be estimated. In the case of a discrete distribution like a binomial distribution, this expression is nothing but a probability mass function ($pmf$). It captures all the possible ways 21 heads can come up within 30 trials. If we take only one instance out of all the possible cases, the combination operator ${30 \choose 21}$ will go away and only $p^{21}(1-p)^{(30-21)}$ will be left out. Since 30 trials are run only once, we can assume that the likelihood function is $$L=p^{21}(1-p)^{9}$$ Once we obtain the likelihood function, then we can proceed with MLE as shown below:

$$ \begin{align} L &= p^{21}(1-p)^{9}\\ \frac {\partial L}{\partial p}&= 21p^{(21-1)}(1-p)^9 + p^{21}(1-p)^{9-1}(-1)=0\\ &\implies 21p^{20}(1-p)^9 = p^{21}(1-p)^8 \\ &\implies 21(1-p)=p\\ &\implies p = \frac{21}{30}\end{align}$$ So, the result obtained earlier and with MLE are same.

Case 2

Estimate the parameters of a linear regression model using MLE

Solution:

Let us assume that there is a dataset $D$ with $n$ data points in a $m$ dimensional space. Hence, $D_i=(X_i,y_i),\ \forall i = {1,2,3,...,n}$ and $X_i \in R^m$. This is a typical case of multiple regression where there are more than one predictor variable and one target variable. The linear regression model can be written as $$\hat y_i=\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + ... + \beta_mx_{im}$$ According to the above equation $\hat y_i$ is a linear combination of predictor variables (The linearity assumption). It is to be noted that $X_i = \{x_{i1}, x_{i2}, x_{i3}, ..., x_{im}\}$ where $x_{ij}$ is the $i^{th}$ value of the $j^{th}$ column. $\beta_0$ is the intercept. Thus, there are $m+1$ parameters to be estimated for this model. To apply, MLE, we need to make assumption about the distribution of an entity. Here comes another important assumption of linear regression, the errors are normally distributed. If $e_i$ is the error in prediction, $e_i = y_i - \hat y_i$ and as per the second assumption, $e_i \sim N(0, \sigma^2)$. Since error terms are normally distributed, another important assumption comes in implicitly, that is, error variance is homoskedastic in nature, meaning that the error terms are having uniform variations across the spectrum of the predictor variables. Or simply speaking, if error terms are plotted with respect to the target variable or any predictor variable, the scatter plot should look like a white noise without any pattern. The below image is helpful in understanding the meaning of homoskedasticity of error variance.

Source: https://www.originlab.com/doc/Origin-Help/Residual-Plot-Analysis

With these three assumptions, it is possible to estimate the parameters of the linear regression model. For this we note that if $x \sim N(\mu, \sigma^2)$, $$P(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {\parallel x-\mu \parallel^2}{2\sigma^2}\right )$$ where $\mu$ represents the mean and $\sigma^2$ represents the variance of x respectively. Hence, if $e_i$ follows a normal distribution with 0 mean and $\sigma^2$ variance, $$\begin{align}P(e_i|0, \sigma^2) &= \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {\parallel e_i-0 \parallel^2}{2\sigma^2}\right ) \\ &= \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {e_i^2}{2\sigma^2}\right )\end{align}$$ Since there are $n$ data points within the dataset D, likelihood function of error is given by $$L = \prod_{i=1}^nP(e_i|0,\sigma^2)$$ The above expression is possible only if individual $e_i$ values are uncorrelated with each other. This is the fourth important assumption in linear regression, i.e., error terms are uncorrelated with each other (or absence of autocorrelation of error terms). Since $log$ is a monotonous function, if $log(f(x|\theta))$ attains maximum at some $\theta^*$, the function $f(x|\theta)$ would also attain maximum at the same $\theta^*$. Since likelihood deals with product of probabilities, we take $log$ of likelihood to convert product of probabilities into addition of log probabilities. Compared to multiplication, addition is easier to deal with while using calculus. Hence, we apply MLE on the log likelihood rather than the likelihood function. If $l$ is denoted as the log likelihood,

$$\begin{align} l &= \sum_{i=1}^nln(P(e_i|0,\sigma^2)) \\ &= \sum_{i=1}^n ln \left [\frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {e_i^2}{2\sigma^2}\right ) \right ] \\ &= \sum_{i=1}^n ln \left [\frac{1}{\sqrt{2\pi \sigma^2}}\right ] + \sum_{i=1}^n ln \left [exp \left( - \frac {e_i^2}{2\sigma^2}\right ) \right ] \\ &= Const + \sum_{i=1}^n \left [-\frac{e_i^2}{2\sigma^2} \right ] \\ &= Const - \frac{1}{2\sigma^2} \sum_{i=1}^n e_i^2 \end{align}$$ As per the above expression, the only way to maximize $l$ is to minimize $ \sum_{i=1}^n e_i^2$ because error variance $\sigma^2$ is constant. Now $ \sum_{i=1}^n e_i^2$ is nothing but the sum of square of error terms. This means that log likelihood is going to be maximized if the sum of square of error (popularly denoted as SSE) is minimized. Now, $$\begin{align}\sum_{i=1}^n e_i^2&=\sum_{i=1}^n (y_i-\hat y_i)^2 \\ &= \sum_{i=1}^n (y_i-X_i\beta )^2 \\ &= (y-X\beta)^T(y-X\beta) \ \ \ \ \ \ \ in\ matrix\ format\ where\ \beta=\{\beta_0, \beta_1, \beta_2,...\beta_m\} \\ &=(y^T-(X\beta)^T)(y-X\beta) \\ &= (y^T - \beta^TX^T)(y-X\beta) \\ &= y^Ty - y^TX\beta -\beta^TX^Ty + \beta^TX^TX\beta \end{align}$$ The above function is a function of parameters $\beta$ and hence, to minimize the SSE, we need to differentiate the above expression with respect to $\beta$. This is where matrix calculus will come into play and we write $$\begin{align}\frac{\partial}{\partial \beta}\sum_{i=1}^n e_i^2 &= \frac{\partial}{\partial \beta}(y^Ty - y^TX\beta -\beta^TX^Ty + \beta^TX^TX\beta) \\ &= 0 - X^Ty - X^Ty + 2X^TX\beta\ \ \ \ because\ X^TX\ is\ square\ symmetric \\ &= -2X^Ty + 2X^TX\beta \end{align}$$For minima, $\frac{\partial}{\partial \beta}\sum_{i=1}^n e_i^2=0$ for all $\beta_j$ values where $j=\{0,1,2,3,...,m\}$. Thus we get, $$-2X^Ty + 2X^TX\beta=0$$. After rearrangements, we get $$X^TX\beta = X^Ty$$ $$\implies \beta = (X^TX)^{-1}X^Ty$$The term $(X^TX)^{-1}X^T$ is called the pseudo inverse. The result that has been got is same as that of Ordinary Least Square (OLS) method. Another important assumption is hidden in this expression. If the columns are having multi-collinearity, $det(X^TX)$ would be very small and the inverse will have very high (or even infinite) values. That would make the model unreliable and would render rather useless for practical purposes. Hence, one more critical assumption come into play, the variables are uncorrelated with each other. Thus, linear regression comes with 5 very critical assumptions:

The model comprises of linear combination of the predictor variables
The error terms of the model are normally distributed
The error terms have constant variance across the levels of the predictor variables (error variance is homoskedastic in nature)
Error terms do not have any auto-correlation with themselves
Variables are uncorrelated with each other (no multicollinearity)

With MLE of parameters, the likelihood value can also be found out and with this value two more metrics can be looked at to compare model's performance. These are Akaike Information Criteria (AIC) and Bayesian Information Criteria(BIC). $$AIC= 2\beta -2ln(Likelihood)$$ $$BIC=\beta ln(n) - 2ln(Likelihood)$$. These two criteria play important roles in deciding the best statistical model out of different models with different number of parameters to estimate. BIC is however more stringent than AIC. A classic use of AIC or BIC is in stepwise regression model generation where the important variables are kept within the model and least important variables are often discarded.

In this post, I have tried to make the readers understand the process of MLE and how it is used in parameter estimation. I hope you will find it useful to clarify the concepts.

Friday, 17 May 2024

Understanding the math behind Principal Component Analysis (PCA)

Let us consider a dataset D in an $m$ dimensional feature space having $n$ samples. Mathematically, $D=\{x_i,y_i\}$ where $x_i \in R^m$. PCA is unsupervised in nature and hence it does not involve $y_i$. It is quite possible that the features are having some level of linear correlation and hence, from pure mathematical consideration, the features are not orthogonal to each other. Some of the features may have very high level of multicollinearity whereas the other features may stay reasonably uncorrelated with each other. In PCA, we must consider linear correlation only (i.e., Pearson Correlation Coefficient). PCA can be considered as the process of extracting new features which are aligned along the direction of the stretch (or spread) of the data points. Let us try to understand the same with codes and diagrams. For better understanding, let us focus on a 2D space. The code below creates the datapoints and the plot shows the scatter.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(20)
x = np.random.normal(10,2, size=10000)
y = 4*x + np.random.normal(2,4, size=10000)

z = np.c_[x,y]

sns.scatterplot(x=x,y=y)
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

It is quite evident that the relationship between X and Y is linear. X and Y are orthogonal in this diagram and from a linear regression point of view, this scatter is desirable when Y is the target variable and X is the predictor variable. But, if both X and Y are predictor variables, then this scatter is going to create trouble due to the concept of multicollinearity of features. In the pattern recognition concept, it is quite common to transfer the data from an existing (or input) feature space to a new (or mapped) feature space. It is popularly known as feature engineering. So, here comes the question, "What is the meaningful way of shifting the data from the current input space to a new feature space?". For instance, have a look at the following new feature spaces.

Out of the four different feature spaces, which one is most meaningful? The answer is, probably the first one. This is because of the fact that the features are trying to align them along the direction of the spread of the data. But, can it be optimized even more? The answer is yes. We apply PCA.

When we want to transfer data points from one feature space to another feature space with linear mapping (i.e., by using a linear function), we need a transformation matrix. Let us assume that U is the required transformation matrix such that, \[ U=\left [\matrix {|&|&|&&&| \\ u_i & u_2 & u_3 & . & . & u_m \\ |&|&|&&&| } \right ] \] where $u_i$ is a column vector with unit length. It is also to be understood that $u_i$ is a vector representing the $i^{th}$ axis of the new feature space. If a simple dot product is taken between $X_i=[x_{i1}, x_{i2}, x_{i3},...,x_{im}]$ and $U$, the resultant outcome will be the projection of the $i^{th}$ data point in the input space to the $i^{th}$ data point in the new feature space. The data point will remain in the same location, only the coordinates will change due to feature space mapping. Since the pattern within the data will remain the same irrespective of such linear feature space mapping, the pattern will also remain unchanged with shifting of origin to any arbitrary location linearly. Hence, it is convenient to shift the origin to the centroid of the dataset. In the above image also, the same is done for the new features, i.e., the origin of the new feature space is located at the centroid of the dataset. Let $X_c$ be the centered dataset such that the origin is located at the centroid of the dataset.

Now we can focus on the objective of PCA. Our objective is to align the new features along the stretch (or spread) of the dataset. That is, along a $j^{th}$ dimension, the variance explained should be maximum. Let $[a_{j1}, a_{j2}, a_{j3},...,a_{jn}] = <X_c, U_j>$ are the projections of the data point in the input space to the $j^{th}$ dimension of the new feature space. Let $\sigma_j^2$ is the variance along the $j^{th}$ dimension in the new feature space. It is to be noted that as the origin was shifted to the centroid of the dataset, \[\mu_j = \frac{1}{n}\sum_{i=1}^n a_{ji} =0 \]

Now, we can calculate the variance of $[a_{ij}]\ \forall i \in [i,2,3,...,n]$ as:

$$ \begin{align} \sigma_j^2 & = \frac{1}{n}\sum_{i=1}^n (a_{ij}-\mu_j)^2 \\ & = \frac{1}{n}\sum_{i=1}^n (a_{ij})^2 \\ & = \frac{1}{n} (X_cU_j)^T(X_cU_j) \\ & = \frac{1}{n}U_j^TX_c^TX_cU_j \\ & = U_j^T\left (\frac{X_c^TX_c}{n}\right )U_j \\ & = U_j^T \Sigma U_j \end{align} $$

Note that when the original data is centered, $\frac{X_c^TX_c}{n}$ is the covariance matrix and the covariance matrix is a square symmetric matrix. The objective function is to maximize $\sigma_j^2$ subjected to $U_j^TU_j=1$ because $U_j$ is a unit vector. This is a constrained optimization problem which can be solved by using LaGrange's multiplier method. $$L=U_j^T\Sigma U_j - \alpha_j(U_j^TU_j-1)$$ where $\alpha$ is the Lagrange's Multiplier. The objective is to find $U_j$ that maximizes $\sigma_j^2$ and hence, we need to equate $\frac{\partial L}{\partial U_j}=0$. This leads to $$2\Sigma U_j - 2\alpha_j U_j = 0$$ From here, it is clear that $$\Sigma U_j = \alpha_j U_j$$ The above equation is well known as the eigen decomposition of a square matrix. This means that the principal component $U_j$ is essentially the eigenvector of the covariance matrix. Since the covariance matrix is a square symmetric matrix of size $(m \times m)$, there will be exactly $m$ eigenvectors with $m$ eigenvalues. Putting the value of $\Sigma U_j$ is the expression of $\sigma_j^2$, we see that $$\sigma_j^2=U_j^T\Sigma U_j = U_j^T\alpha U_j=\alpha_j U_j^TU_j = \alpha_j$$ Thus, the eigenvalue represents the amount of variance explained by the respective principal component along the $j^{th}$ dimension. An added benefit from this analysis is that the principal components will be perfectly orthogonal to each other because eigenvectors of a square symmetric matrix are always orthogonal to each other.

Covariance is not bounded and the variable which has higher variance always tries to dominate over other variables (or features). Hence, instead of covariance, a correlation matrix is preferred which is essentially the covariance matrix of the standardized data.

The final principal components for the sample data are shown in the figure along with the codes.

# find correlation coefficient of z
cor_mat = np.corrcoef(z, rowvar=False)
eig_val, eig_vec = np.linalg.eig(cor_mat)

x_mean = x.mean()
y_mean = y.mean()

plt.figure(figsize=(6,6))
sns.scatterplot(x=x,y=y)
plt.quiver([x_mean, x_mean], [y_mean, y_mean],  eig_vec[0,:], eig_vec[1,:],
           [3,10], scale=5, cmap='flag')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

PCA as a method of dimension reduction

After eigen decomposition of the correlation matrix, we also get another interesting outcome. $$\Sigma_{j=1}^m\sigma_j^2 = \Sigma_{j=1}^m\alpha_j = trace(Corr Matrix)=m$$ After PCA, the eigenvalues are sorted in a descending manner and the eigenvectors are also rearranged with respect the sorted eigenvalues. Then we choose $r$ number of principal components such that $$\frac{1}{m}\Sigma_{j=1}^r\alpha_j \ge \epsilon$$ with the minimum value of $r$ so that the total variance retained is at least $\epsilon$ percentage of the total variance. Since $r < m$, with r number of mutually independent dimensions $m$ dimensional dataset is represented with some loss of information. Apart from this method, there is a criterion called the Kaiser criteria that can be used to extract a lesser number of dimensions. As per the Kaiser criteria, only those eigenvectors are extracted for which the corresponding eigenvalues are more than 1. The third method is the choice of the analyst where the analyst chooses the best number of principal components.

Implicit Assumptions

Since PCA deals with the covariance matrix, it is associated with the implicit assumptions:

The variables should be normally distributed (so that mean and variance make sense)
Principal components are the linear combination of the input variables

I hope that the readers will find this post helpful in understanding the concept.

Thursday, 16 May 2024

Importatnt Probability Distribution (Part 2)

MathJax example

In the previous post we had seen binomial and Poisson distributions. They are discrete in nature. Similarly, continuous probability distributions are also there. One of the most important distribution is normal distribution. Normal distribution for a one dimensional data has two parameters, i.e. mean $\mu$ and standard deviation $\sigma$. The probability density function (pdf) is defined as: \[f(x|\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}} e^{[-\frac{(x-\mu)^2}{2\sigma^2}]}\] For a standard normal distribution, $\mu$ = 0 and $\sigma$ = 1 which leads to: \[ f(x|0,1) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} \] However, for multivariate scenario, the expression becomes complex. For multivariate normal distribution, we need to look at the covariance matrix $\Sigma$ and the mean vector $\mu$. The expression of pdf changes to: \[ f(x|\mu, \sigma) = \frac{1}{{(2\pi)^{d/2}}|\Sigma|^{\frac{1}{2}}}exp[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)]\] where $d$ is the dimension of the multivariate distribution. Normal distribution has applications in hypothesis testing in statistical inferences. More importantly, when the population standard deviation is known, this distribution can be used effectively to determine if sample means of two samples are statistically significantly different or not. However, in most of the situations, population standard deviation is not known. Hence, we need to work with another distribution that behaves similar to a normal distribution when the sample size is more and yet can be used in situations where the sample size is small (say 10) and more importantly, when the population standard deviation is unknown. Student t distribution has all the required characteristics to deal with the above three situations and hence, for statistical hypothesis tests, student t distribution is more preferred. The equation of student t distribution is given by: \[ \frac{\Gamma \left(\frac{\nu +1}{2}\right)}{\sqrt{\pi \nu}\Gamma(\frac{\nu}{2})}\left(1+\frac{x^2}{\nu}\right)^{-(\frac{\nu+1}{2})}\] where $\nu$ is the degree of freedom and $\Gamma[.]$ is the Gamma function. The above function is applicable for one dimensional data. For multivariate situation, the same expression changes to: \[ \frac{\Gamma \left(\frac{\nu +p}{2}\right)}{(\pi \nu)^{\frac{p}{2}}\Gamma(\frac{\nu}{2})}\left(1+\frac{(x-\mu)^T\Sigma^{-1}(x-\mu)}{\nu}\right)^{-(\frac{\nu+p}{2})}\] where $p$ is the dimension of the data and $\nu$ is the degree of freedom. If both the multivariate functions of both the distributions are observed carefully, they both have the component $(x-\mu)^T\Sigma^{-1}(x-\mu)$ and this part is, essentially, the squared Mahalanobis distance. The characteristics of this distance is that it is unitless and it is scale-invariant in nature. Hence, if the data under consideration is following a multivariate normal distribution, Mahalanobis distance can be used to detect the presence of multivariate outliers.

To understand the use, a synthetic dataset is created with only two variables so that the spread can be shown diagrammatically. The codes for generating data and viewing them is shown below.

import numpy as np
import seaborn as sns
import pandas as pd

n_inliers = 100
n_outliers = 20
n_features = 2

# Generate covariance matrix for inlier data and outlier data
inliers_cov_mat = np.array([[2,0],[0,1]])
outliers_cov_mat = np.array([[5,0],[0,2]])

np.random.seed(1) # Fix the seed for consistent output

# generate inlier and outliers data points
inlier_data = np.random.normal(size=(n_inliers, n_features)) @ inliers_cov_mat
outlier_data = np.random.normal(size=(n_outliers, n_features)) @ outliers_cov_mat

# Join the data points
final_data = pd.DataFrame(np.vstack([inlier_data, outlier_data]), 
                          columns=['x','y'])
final_data['labels'] = np.repeat(['inliers','outliers'],[100,20]) # Labels only for identification

# Generate the scatter plot for visualization
sns.scatterplot(data=final_data, x='x', y='y', hue='labels')

The image shown below shows the scatter plot with normal (or inlier) data points and outliers. Note that not all the marked outliers are actually outliers. Some of the marked outliers are very much within the region of inliers. But there are several outliers existing within the dataset. The objective here is to identify them using statistical outlier detection method, i.e., by using Mahalanobis Distance measures. This method of outlier detection is multivariate in nature and hence more appropriate while analyzing multivariate data.

Covariance matrix play an important role in calculating Mahalanobis distance and covariance is also influenced by the presence of outliers. That is why, multiple covariance matrices are estimated based on samples of the data and that covariance matrix is chosen which has the lowest determinant score. The smaller the determinant score, the closely packed the data points, i.e., plausible inliers. Python package scikit-learn has the MinCovDet (MCD) class for the said purpose. It tries to find $\frac{n_{samples}+n_{features}+1}{2}$ samples which has the lowest determinant score, leading to a pure subset of samples from the existing dataset. We apply the same method on the above mentioned dataset to get a robust estimation of the covariance matrix that can be used subsequently to detect outliers. The code snippet is given below:

from sklearn.covariance import MinCovDet

mcd = MinCovDet(random_state=123)
mcd.fit(X=final_data.loc[:,['x','y']])
print(f"The robust covariance matrix is: \n\n{mcd.covariance_}")

# OUTPUT
# ======
# The robust covariance matrix is: 

# [[2.55655394 0.05976092]
#  [0.05976092 0.9295315 ]]

It can be seen that the covariance matrix is similar to that of the covariance matrix created for the inliers. The 'mcd' object also contains the Mahalanobis distance of the data points in the input dataset. This distance is calculated with respect the mean vectors associated with the data points used in calculating the robust covariance matrix. These distances are one dimensional in nature and hence, the outliers can be extracted based on boxplot or IQR based methods. The following code snippet shows the final outlier detection process.

def extract_univariate_outlier_IQR(array_1D):
    percentile_values = np.percentile(array_1D, q=[25,75])
    iqr = percentile_values[1] - percentile_values[0]
    lower_cutoff = percentile_values[0] - 1.5*iqr
    upper_cutoff = percentile_values[1] + 1.5*iqr
    out = np.where((array_1D > upper_cutoff) | (array_1D < lower_cutoff), 1, 0 )
    return out
    

outliers = extract_univariate_outlier_IQR(mcd.dist_)

# Generate the scatter plot for visualization
sns.scatterplot(data=final_data, x='x', y='y', hue=outliers)

The above plot clearly shows the outliers. It is to be noted that this process is applicable for unimodal distribution. For multimodal distribution, this process would be erroneous. In such scenarios, other methods such as Local Outlier Factor, Isolation Forest or even SVM one class novelty detection can be used.

In this post, I have tried to connect use of normal distribution to one important analysis in machine learning, i.e., outlier detection. I hope this would be helpful.