Thursday, 26 December 2024

EM Algorithm and its usage (Part 2)

EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is quite helpful in many other Machine Learning (ML) applications. EM algorithm can be used in missing value imputation, topic modelling (variational EM), anomaly detection, clustering etc. In this post, I am going to discuss how the EM algorithm can be used in clustering.

EM Clustering (Gaussian Mixture Model)

We all know that clustering is an unsupervised machine learning method. Hard clustering methods such as KMeans, Hierarchical and DBSCAN etc., attach a single data point to a single cluster only. EM clustering, on the other hand, associates each data point to all clusters with varying degrees of association. That is why, it is also called a soft clustering technique. EM clustering is essentially a Gaussian Mixture Model which assumes that each cluster is generated from some Gaussian. Thus, EM clustering actually generates a model with likelihood scores. To understand the details of this process, we need to understand the math behind it. 

Gaussian Mixture

As mentioned above, EM Clustering assumes that clusters are essentially samples drawn from some Gaussian and, if there are K numbers of clusters available, the distribution of the data is essentially a mixture of K Gaussian. This is where the concept of multivariate normal distribution comes into play. The generalized equation of a d-dimensional normal distribution is: $$f_i(x|\mu_i, \Sigma_i)=\frac{1}{(2\pi)^{d/2}|\Sigma_i|^{d/2}}exp\left[-\frac{(x-\mu_i)^T\Sigma_i^{-1}(x-\mu_i)}{2}\right]$$ where $\mu_i$ and $\Sigma_i$ are the mean vector and covariance matrix respectively. As per the mixture model, the probability density function (pdf) of $x$ is the summation of pdf of all the K normals. That is, $$f(x) = \Sigma_{i=1}^Kf_i(x)P(C_i)=\Sigma_{i=1}^Kf_i(x|\mu_i, \Sigma_i)P(C_i)$$ $P(C_i)$ is called the mixture parameter which is a prior probability (probability of occurrence of the $i^{th}$ cluster itself) and it must satisfy the condition $$\Sigma_{i=1}^KP(C_i)=1$$ Thus, the number of parameters to estimate is a set $$\theta = \{\mu_1, \Sigma_1, P(C_1), ... \mu_K, \Sigma_K, P(C_K)\}$$ It is quite easy to understand that the number of parameters to estimate is quite large compared to (say) KMeans Clustering algorithm. 

Maximum Likelihood Estimation

If the dataset is denoted by $D$ having $n$ number of data points, the likelihood of $\theta$ is given by $$L = \Pi_{j=1}^nf(x_j)$$ It is to be kept in mind that the assumption of i.i.d of $x_j$ is important for this expression to be valid. Usually, for clustering tasks, we assume that the data points are following i.i.d condition. For mathematical convenience, we work with log-likelihood function and not likelihood function. Log being monotonic in nature, optimum result with log function would result in the optimum value for the original function also. Thus, optimizing log-likelihood is equivalent to optimizing likelihood. The log-likelihood function is given by $$ln(L)=\Sigma_{j=1}^nln(f(x_j)=\Sigma_{j=1}^nln\left(\Sigma_{i=1}^Kf_i(x_j|\mu_i, \Sigma_i)P(C_i)\right)$$ 

Expectation Maximization to rescue

The above equation shows that optimizing the parameters using MLE directly is a prohibitively difficult task. In fact, MLE in such cases becomes computationally intractable when it is to optimize the number of clusters and the best set of parameters. Hence, to solve this problem, the number of clusters is fixed and the parameters are estimated iteratively. The main objective of the clustering process then turns out to be finding out the conditional probability of cluster membership when the data is given. Mathematically, $P(C_i|x_j)$ is required to be found out. Since this is a conditional probability, Baye's Theorem can be used easily. Applying Baye's theorem, $$P(C_i|x_j)=\frac{P(x_j|C_i)P(C_i)}{\Sigma_{l=1}^KP(x_j|C_l)P(C_l)}$$ If pdf is known, the probability is found out by calculating the area under the curve. If a small interval $\epsilon>0$ is considered centered at $x_j$, we can calculate the conditional probability $p(x_j|C_i)$ as $$p(x_j|C_i) \approx 2\epsilon .f_i(x_j|\mu_i, \Sigma_i)$$ Based on the above equation, the posterior probability $P(C_i|x_j)$ becomes $$P(C_i|x_j)=\frac{f_i(x_j|\mu_i, \Sigma_i)P(C_i)}{\Sigma_{l=1}^Kf_l(x_j|\mu_l, \Sigma_l)P(C_l)}=w_{ij}$$ $P(C_i|x_j)$ is considered as the weight or contribution of the point $x_j$ to cluster $C_i$. With $w_{ij}$ extracted, the Expectation Step in EM algorithm is over. In the Maximization step, the same $w_{ij}$ values are used to reestimate the parameters as shown below. $$\mu_i=\frac{\Sigma_{j=1}^nw_{ij}x_j}{\Sigma_{j=1}^nw_{ij}}$$ $$\Sigma_i=\frac{\Sigma_{j=1}^nw_{ij}(x-\mu_i)(x-\mu_i)^T}{\Sigma_{j=1}^nw_{ij}}$$ $$P(C_i)=\frac{\Sigma_{j=1}^nw_{ij}}{n}$$

The parameters estimated in the Maximization step are fed back to the Expectation step and the process continues till convergence. With more features, the number of parameters to estimate becomes larger and the process becomes slower. However, EM clustering process gives more insight into the clusters than KMeans clustering process. One such information is the uncertainty measure. Since each data point is associated with each cluster with varying degrees of association (given by the probabilities), uncertainty is defined as $$uncertainty=1- max(probability)$$. This uncertainty is quite helpful to identify the data points which are lying near the boundaries of two clusters. The higher the uncertainty, the higher the chance that with a small push, the data point can be pushed into its neighbouring cluster. This aspect may be quite helpful in customer segmentation analysis with clustering.

EM Clustering with Python

This is the practical part of EM clustering using Python. For the said task, the "Wholesale customer data.csv" is downloaded from this Kaggle webpage. It is a small dataset (440 rows) with 8 features whose description is given below.

  • Channel: Channel used to purchase the goods (categorical) 
  • Region: The region of the buyer (categorical) 
  • Fresh: Amount spent to buy fresh items 
  • Milk: Amount spent to buy milk items 
  • Grocery: Amount spent to buy grocery items 
  • Frozen: Amount spent to buy frozen foods 
  • Detergent Papers: Amount spent on detergent papers 
  • Delicassen: Amount spent on delicatessen.

The first two features are not used in this analysis as they are categorical in nature and their values are nominal. The remaining six features are used for clustering. The features are standardized with scaling and are made ready for clustering. The Python codes are given below:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
import numpy as np

data = pd.read_csv('Wholesale customers data.csv')
data.head()


Since Gaussian Mixture Model (GMM) actually builds a model with probabilities associated with each data point, it is possible to extract the likelihood scores for every model and hence the Bayesian Information Criteria (BIC) scores can be calculated for each model. This BIC score can be used effectively to arrive at the optimum number of cluster. The model with lowest BIC score is the best model and the number of components in the model is the optimum number of clusters. The python code and the associated plot shows that the optimum number of clusters is 5

data_scaled = StandardScaler().fit_transform(data.drop(['Channel', 'Region'], axis=1))

# Optimize best cluster solution with BIC Scores
bic = []
for i in range(1, 21):
    gmm = GaussianMixture(n_components=i, n_init=5, random_state=42)
    gmm.fit(data_scaled)
    bic.append(gmm.bic(data_scaled))

plt.plot(range(1, 21), bic)
plt.xlabel('Number of components')
plt.ylabel('BIC')
plt.show()


For visualization purpose, the contour plots of the clusters are created along with uncertainty measures of data. As mentioned above uncertainty shows the level of uncertainty associated with a data point when it associated with a cluster having highest probability of association. The higher the uncertainty, the higher the possibility of the data point getting miss-clustered. This is important for customer segmentation. The below code and the plot shows the same. TSNE is used to plot the high dimensional data into 2 dimensions.

best_gmm = GaussianMixture(n_components=5, n_init=5, random_state=42)
best_gmm.fit(data_scaled)

clusters = best_gmm.predict(data_scaled)

cluster_prob = best_gmm.predict_proba(data_scaled)
uncertainty = 1 - np.max(cluster_prob, axis=1)

tsne = TSNE(n_components=2, random_state=42)
tsne_data = tsne.fit_transform(data.loc[:,'Fresh':'Delicassen'])

# Countour plot creation
xx, yy = np.meshgrid(np.linspace(-20, 30, 100), np.linspace(-20, 30, 100))
grid_data = np.array([xx.ravel(), yy.ravel()]).T
gmm_tsne = GaussianMixture(n_components=5, n_init=5, random_state=42)
gmm_tsne.fit(tsne_data)
zz = gmm_tsne.score_samples(grid_data).reshape(xx.shape)

plt.contourf(xx, yy, zz, alpha=0.3, cmap='tab20', levels=100)

# Plot scatter plot on the contour plot
plt.scatter(tsne_data[:, 0], tsne_data[:, 1], c=clusters, cmap='tab20', 
            s=uncertainty*100)
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.show()


In the above plot, the larger the size of the bubble, larger the uncertainty. Data points with higher uncertainty need to be analyzed separately.

Conclusion

EM clustering is essentially a Gaussian Mixture Model which models each cluster as a sample from a specific Gaussian with its own set of parameters. KMeans clustering is, in fact, a special case of EM clustering. EM clustering gives many important insights about the clusters which KMeans clustering does not give directly. However, EM clustering is slower compared to KMeans clustering and it is not so scalable as KMeans clustering. But, a careful marketing analyst would, probably, want to see various aspects of the clusters formed before deciding next marketing moves and in such situations, EM clustering can be quite helpful.

Thursday, 19 December 2024

EM Algorithm and its usage (Part 1)

Maximum likelihood estimate is a very powerful and convenient method of estimating parameters of a model. A detail of the same is given in an earlier article. This method is used in several statistical models to estimate the model parameters in an unbiased way. However, a key requirement of this method is to have complete information related to the data, e.g., the distribution to follow and other aspects related to the distribution. But, in real life cases, it is not unusual to have datasets for which some of the information are hidden and yet it is required to estimate parameters of a model to extract the pattern. A simple scenario is mentioned below. 

Two coins problem:

Assume that there are two coins each of which is biased either towards head or tail. Let us assume that coin A is biased towards head and coin B is biased towards tail. At random, one coin is chosen and it is tossed for 10 times. This experiment is done 5 times. The below image shows how easy it is to calculate the level of biasness when all information are known, that is which coin was chosen and how many heads and tails are observed. 


The estimates are nothing but Maximum Likelihood Estimates based on the available data and information. The same problem becomes quite complicated if one information is kept hidden, i.e., which coin was chosen. In such a situation, working with MLE is not advisable as MLE is bad at dealing with such situation. However, with Expectation Maximization (EM) algorithm, it is possible to estimate the parameters in an iterative way. As the name suggests, EM algorithm has an expectation step and a maximization step. The primary objective of the expectation step is calculating probabilities associated with the data points so that using the probabilities, an expression of likelihood can be found out. The likelihood function is then maximized by optimizing the parameters. The parameters then take part is recalculating the probabilities for each data point and the likelihood value is again maximized by optimizing the parameters. Thus, in an iterative manner the parameters keep updating until a breaking point (or threshold limit) is reached.  


The problem of two coins can be solved with EM algorithm when the critical information is mission, i.e., which coin was chosen for tossing. For applying the algorithm, an initialization is needed which is mostly done in a random manner. Afterward, the iterative process commences. The process is shown in the diagram below.


In the above diagram, how probabilities are calculated is not mentioned explicitly. This is mentioned in the below diagram for the first data point where there were 5 heads and 5 tails. $\theta_A^{(0)}=0.6$ and $\theta_B^{(0)}=0.5$. 

After 10 iterations, the values converged to 0.8 for coin A and 0.52 for coin B. Obviously, the number of experiments was less; hence, the iterative algorithm shows both coins as biased towards heads. The same experiment can be replicated with a Python routine. The code is given below:

import numpy as np
import scipy
import scipy

# Generate Data
'''
    Each coin is tossed 100 times and the experiment is done 120 times, 60 times for each coin
'''
coinA = np.random.binomial(n=1, p=0.8, size=(60,100)) # probability of head = 0.8
coinB = np.random.binomial(n=1, p=0.35, size=(60,100)) # probability of head = 0.35
data = np.vstack([coinA, coinB])

index = np.array(range(data.shape[0]))
np.random.shuffle(index)
data = data[index]

# Create frequency table for heads and tails counts
n_heads = data.sum(axis=1)
n_tails = data.shape[1] - n_heads
freq_table = np.hstack((n_heads.reshape(-1,1), n_tails.reshape(-1,1)))

# EM Algorithm starts
# --------------------

# Initialization
thetaA = 0.6
thetaB = 0.3
n_iter = 20

for it in range(n_iter):

    # E-Step starts:
    p_headsA = np.array([scipy.stats.binom.pmf(r, data.shape[1], thetaA) for r in freq_table[:,0]])
    p_headsB = np.array([scipy.stats.binom.pmf(r, data.shape[1], thetaB) for r in freq_table[:,0]])

    P_A = p_headsA/(p_headsA + p_headsB)
    P_B = p_headsB/(p_headsA + p_headsB)

    expected_count_A = P_A.reshape(-1,1) * freq_table
    expected_count_B = P_B.reshape(-1,1) * freq_table

    # E-Step ends

    # M-Step starts
    expected_total_A = expected_count_A.sum(axis=0)
    expected_total_B = expected_count_B.sum(axis=0)

    thetaA = expected_total_A[0]/ expected_total_A.sum()
    thetaB = expected_total_B[0]/ expected_total_B.sum()

    # M-Step ends

print(f"Probability of head in coin A: {thetaA}")
print(f"Probability of head in coin A: {thetaB}")

It will be seen that the values are reaching closer to the input parameters for creating the dataset. This is the EM algorithm in Python for solving the two coin problems.

I hope that the explanations given are understandable.

Comments are welcom.

Monday, 27 May 2024

Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do the calculations manually, and that is where iterative methods were never encouraged due to manual labour and hence the possibility of errors in calculation. During that time, those methods were preferred where the number of iterations was less. Maximum Likelihood Estimate (MLE) was one of the most preferred methods in this area and there were several algorithms whose parameters were estimated using MLE. In fact, this method of parameter estimation is still quite popular. In this blog post, I shall discuss about the math behind this wonderful method. 

Before using MLE in the estimation process, it is important to make an assumption on the distribution of either the data points or more importantly, the error terms. For example, to use MLE in estimating the parameters of a regression model, it is assumed that the error terms are normally distributed and, in logistic regression, it is assumed that the target variable is binomially distributed with some probability of success. Without such assumptions, it is impossible to use MLE. To use MLE, it is important to arrive at the expression of likelihood which is nothing but the joint probability of observing the data. To understand it mathematically, $$L=P(X_1, X_2, X_3,...,X_n)$$ where $[X_1, X_2, X_3, ..., X_n] \in P(X_i|\theta) \forall i={1,2,3,...,n}$. Here, $\theta$ is the parameter set of the probability function $f(.)$. For simplicity, in many algorithms, it is assumed that $X_i$s are all independently and identically distributed. That way, the expression of likelihood changes to a rather simpler expression because of the concept of independent events, i.e., $$L=P(X_1, X_2, X_3,..., X_n)=\prod_{i=1}^nP(X_i|\theta)$$ Here $\prod_{i=1}^n(.)$ is the product of individual elements within $()$. Once the expression of likelihood is correctly identified, the likelihood function becomes the function of the unknown parameters $\theta$ because all $X_i$s are already observed and that's why, they are constants within the likelihood function. Hence, even if likelihood points towards the probability of occurrence of events together, the likelihood function is not actually a probability function. Since likelihood is a function of parameters, maximum likelihood estimate focuses on estimating the parameters of the likelihood function such that the same function could attain the maximum value. In other words, the parameter set estimated through MLE will maximize the joint probability of occurrence of the observed instances. Mathematically, $$\hat \theta^{MLE} = \underset{\theta}{\operatorname {argmax}}\prod_{i=1}^nP(X_i|\theta) $$ So, if there are $k$ number of parameters to be estimated, by applying the rule of calculus, partial derivatives of likelihood function $L$ is equated to zero for all values of $\theta_j, \forall j={1,2,3,...,k}$. That is $$\frac {\partial L}{\partial \theta_j}=0, \forall j={1,2,3,...,k}$$ This process is consistent with the concept of finding maxima (or minima) of a function concerning the underlying parameter. Thus, for $k$ parameters, there would be $k$ equations and hence a unique solution can be found out. Sometimes, the equations are not closed-form equations (i.e., the parameter is a function of itself as in $x = sin(x+y)$). In such situations, numerical methods are required to be adopted (e.g., logistic regression) to estimate the parameters. Let us now use MLE in some cases.

Case 1

A coin is tossed 30 times and it is observed that heads came up 21 times. What is the probability of getting a head when it is tossed the next time?

Solution:

The solution is rather easy $p=\frac {21}{30}=0.70$. But this expression $p=\frac {\#\ of\ heads}{\#\ of\ trials}$ itself is a typical MLE of the probability of success. Let us try to derive it. Let $x_1, x_2, x_3,...,x_30$ are the 30 trials. We assume that the trials are independent of each other and identically distributed (which is quite fair in this situation!). Since there are only two outcomes, the entire process can be assumed to follow a binomial distribution. The probability of observing 21 heads out of 30 trials can be expressed as $$P(heads=21|p, 30)={30\choose 21}p^{21}(1-p)^{(30-21)}$$ here $p$ is the required probability of heads to be estimated. In the case of a discrete distribution like a binomial distribution, this expression is nothing but a probability mass function ($pmf$). It captures all the possible ways 21 heads can come up within 30 trials. If we take only one instance out of all the possible cases, the combination operator ${30 \choose 21}$ will go away and only $p^{21}(1-p)^{(30-21)}$ will be left out. Since 30 trials are run only once, we can assume that the likelihood function is $$L=p^{21}(1-p)^{9}$$ Once we obtain the likelihood function, then we can proceed with MLE as shown below: 

$$ \begin{align} L &= p^{21}(1-p)^{9}\\ \frac {\partial L}{\partial p}&= 21p^{(21-1)}(1-p)^9 + p^{21}(1-p)^{9-1}(-1)=0\\ &\implies 21p^{20}(1-p)^9 = p^{21}(1-p)^8 \\ &\implies 21(1-p)=p\\ &\implies p = \frac{21}{30}\end{align}$$ So, the result obtained earlier and with MLE are same. 

Case 2

Estimate the parameters of a linear regression model using MLE

Solution:

Let us assume that there is a dataset $D$ with $n$ data points in a $m$ dimensional space. Hence, $D_i=(X_i,y_i),\ \forall i = {1,2,3,...,n}$ and $X_i \in R^m$. This is a typical case of multiple regression where there are more than one predictor variable and one target variable. The linear regression model can be written as $$\hat y_i=\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + ... + \beta_mx_{im}$$ According to the above equation $\hat y_i$ is a linear combination of predictor variables (The linearity assumption). It is to be noted that $X_i = \{x_{i1}, x_{i2}, x_{i3}, ..., x_{im}\}$ where $x_{ij}$ is the $i^{th}$ value of the $j^{th}$ column. $\beta_0$ is the intercept. Thus, there are $m+1$ parameters to be estimated for this model. To apply, MLE, we need to make assumption about the distribution of an entity. Here comes another important assumption of linear regression, the errors are normally distributed. If $e_i$ is the error in prediction, $e_i = y_i - \hat y_i$ and as per the second assumption, $e_i \sim N(0, \sigma^2)$. Since error terms are normally distributed, another important assumption comes in implicitly, that is, error variance is homoskedastic in nature, meaning that the error terms are having uniform variations across the spectrum of the predictor variables. Or simply speaking, if error terms are plotted with respect to the target variable or any predictor variable, the scatter plot should look like a white noise without any pattern. The below image is helpful in understanding the meaning of homoskedasticity of error variance. 

Source: https://www.originlab.com/doc/Origin-Help/Residual-Plot-Analysis
With these three assumptions, it is possible to estimate the parameters of the linear regression model. For this we note that if $x \sim N(\mu, \sigma^2)$, $$P(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {\parallel x-\mu \parallel^2}{2\sigma^2}\right )$$ where $\mu$ represents the mean and $\sigma^2$ represents the variance of x respectively. Hence, if $e_i$ follows a normal distribution with 0 mean and $\sigma^2$ variance, $$\begin{align}P(e_i|0, \sigma^2) &= \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {\parallel e_i-0 \parallel^2}{2\sigma^2}\right ) \\ &= \frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {e_i^2}{2\sigma^2}\right )\end{align}$$ Since there are $n$ data points within the dataset D, likelihood function of error is given by $$L = \prod_{i=1}^nP(e_i|0,\sigma^2)$$ The above expression is possible only if individual $e_i$ values are uncorrelated with each other. This is the fourth important assumption in linear regression, i.e., error terms are uncorrelated with each other (or absence of autocorrelation of error terms). Since $log$ is a monotonous function, if $log(f(x|\theta))$ attains maximum at some $\theta^*$, the function $f(x|\theta)$ would also attain maximum at the same $\theta^*$. Since likelihood deals with product of probabilities, we take $log$ of likelihood to convert product of probabilities into addition of log probabilities. Compared to multiplication, addition is easier to deal with while using calculus. Hence, we apply MLE on the log likelihood rather than the likelihood function. If $l$ is denoted as the log likelihood, 
$$\begin{align} l &= \sum_{i=1}^nln(P(e_i|0,\sigma^2)) \\ &= \sum_{i=1}^n ln \left [\frac{1}{\sqrt{2\pi \sigma^2}}exp \left( - \frac {e_i^2}{2\sigma^2}\right ) \right ] \\ &= \sum_{i=1}^n ln \left [\frac{1}{\sqrt{2\pi \sigma^2}}\right ] + \sum_{i=1}^n ln \left [exp \left( - \frac {e_i^2}{2\sigma^2}\right ) \right ] \\ &= Const + \sum_{i=1}^n \left [-\frac{e_i^2}{2\sigma^2} \right ] \\ &= Const - \frac{1}{2\sigma^2} \sum_{i=1}^n e_i^2 \end{align}$$ As per the above expression, the only way to maximize $l$ is to minimize $ \sum_{i=1}^n e_i^2$ because error variance $\sigma^2$ is constant. Now $ \sum_{i=1}^n e_i^2$ is nothing but the sum of square of error terms. This means that log likelihood is going to be maximized if the sum of square of error (popularly denoted as SSE) is minimized. Now, $$\begin{align}\sum_{i=1}^n e_i^2&=\sum_{i=1}^n (y_i-\hat y_i)^2 \\ &= \sum_{i=1}^n (y_i-X_i\beta )^2 \\ &= (y-X\beta)^T(y-X\beta) \ \ \ \ \ \ \ in\ matrix\ format\ where\ \beta=\{\beta_0, \beta_1, \beta_2,...\beta_m\} \\ &=(y^T-(X\beta)^T)(y-X\beta) \\ &= (y^T - \beta^TX^T)(y-X\beta) \\ &= y^Ty - y^TX\beta -\beta^TX^Ty + \beta^TX^TX\beta \end{align}$$ The above function is a function of parameters $\beta$ and hence, to minimize the SSE, we need to differentiate the above expression with respect to $\beta$. This is where matrix calculus will come into play and we write $$\begin{align}\frac{\partial}{\partial \beta}\sum_{i=1}^n e_i^2 &=  \frac{\partial}{\partial \beta}(y^Ty - y^TX\beta -\beta^TX^Ty + \beta^TX^TX\beta) \\ &= 0 - X^Ty - X^Ty + 2X^TX\beta\ \ \ \ because\ X^TX\ is\ square\ symmetric \\ &= -2X^Ty + 2X^TX\beta \end{align}$$For minima, $\frac{\partial}{\partial \beta}\sum_{i=1}^n e_i^2=0$ for all $\beta_j$ values where $j=\{0,1,2,3,...,m\}$. Thus we get, $$-2X^Ty + 2X^TX\beta=0$$. After rearrangements, we get $$X^TX\beta = X^Ty$$ $$\implies \beta = (X^TX)^{-1}X^Ty$$The term $(X^TX)^{-1}X^T$ is called the pseudo inverse. The result that has been got is same as that of Ordinary Least Square (OLS) method. Another important assumption is hidden in this expression. If the columns are having multi-collinearity, $det(X^TX)$ would be very small and the inverse will have very high (or even infinite) values. That would make the model unreliable and would render rather useless for practical purposes. Hence, one more critical assumption come into play, the variables are uncorrelated with each other. Thus, linear regression comes with 5 very critical assumptions:
  1. The model comprises of linear combination of the predictor variables
  2. The error terms of the model are normally distributed
  3. The error terms have constant variance across the levels of the predictor variables (error variance is homoskedastic in nature)
  4. Error terms do not have any auto-correlation with themselves
  5. Variables are uncorrelated with each other (no multicollinearity)

With MLE of parameters, the likelihood value can also be found out and with this value two more metrics can be looked at to compare model's performance. These are Akaike Information Criteria (AIC) and Bayesian Information Criteria(BIC). $$AIC= 2\beta -2ln(Likelihood)$$ $$BIC=\beta ln(n) - 2ln(Likelihood)$$. These two criteria play important roles in deciding the best statistical model out of different models with different number of parameters to estimate. BIC is however more stringent than AIC. A classic use of AIC or BIC is in stepwise regression model generation where the important variables are kept within the model and least important variables are often discarded. 

In this post, I have tried to make the readers understand the process of MLE and how it is used in parameter estimation.  I hope you will find it useful to clarify the concepts.

Friday, 17 May 2024

Understanding the math behind Principal Component Analysis (PCA)

 Let us consider a dataset D in an \(m\) dimensional feature space having \(n\) samples. Mathematically, \(D=\{x_i,y_i\}\) where \(x_i \in R^m\). PCA is unsupervised in nature and hence it does not involve \(y_i\). It is quite possible that the features are having some level of linear correlation and hence, from pure mathematical consideration, the features are not orthogonal to each other. Some of the features may have very high level of multicollinearity whereas the other features may stay reasonably uncorrelated with each other. In PCA, we must consider linear correlation only (i.e., Pearson Correlation Coefficient). PCA can be considered as the process of extracting new features which are aligned along the direction of the stretch (or spread) of the data points. Let us try to understand the same with codes and diagrams. For better understanding, let us focus on a 2D space. The code below creates the datapoints and the plot shows the scatter.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(20)
x = np.random.normal(10,2, size=10000)
y = 4*x + np.random.normal(2,4, size=10000)

z = np.c_[x,y]

sns.scatterplot(x=x,y=y)
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
It is quite evident that the relationship between X and Y is linear. X and Y are orthogonal in this diagram and from a linear regression point of view, this scatter is desirable when Y is the target variable and X is the predictor variable. But, if both X and Y are predictor variables, then this scatter is going to create trouble due to the concept of multicollinearity of features. In the pattern recognition concept, it is quite common to transfer the data from an existing (or input) feature space to a new (or mapped) feature space. It is popularly known as feature engineering. So, here comes the question, "What is the meaningful way of shifting the data from the current input space to a new feature space?". For instance, have a look at the following new feature spaces.
Out of the four different feature spaces, which one is most meaningful? The answer is, probably the first one. This is because of the fact that the features are trying to align them along the direction of the spread of the data. But, can it be optimized even more? The answer is yes. We apply PCA.

When we want to transfer data points from one feature space to another feature space with linear mapping (i.e., by using a linear function), we need a transformation matrix. Let us assume that U is the required transformation matrix such that, \[ U=\left [\matrix {|&|&|&&&| \\ u_i & u_2 & u_3 & . & . & u_m \\ |&|&|&&&| } \right ] \] where \(u_i\) is a column vector with unit length. It is also to be understood that \(u_i\) is a vector representing the \(i^{th}\) axis of the new feature space. If a simple dot product is taken between \(X_i=[x_{i1}, x_{i2}, x_{i3},...,x_{im}]\) and \(U\), the resultant outcome will be the projection of the \(i^{th}\) data point in the input space to the \(i^{th}\) data point in the new feature space. The data point will remain in the same location, only the coordinates will change due to feature space mapping. Since the pattern within the data will remain the same irrespective of such linear feature space mapping, the pattern will also remain unchanged with shifting of origin to any arbitrary location linearly. Hence, it is convenient to shift the origin to the centroid of the dataset. In the above image also, the same is done for the new features, i.e., the origin of the new feature space is located at the centroid of the dataset. Let \(X_c\) be the centered dataset such that the origin is located at the centroid of the dataset.

Now we can focus on the objective of PCA. Our objective is to align the new features along the stretch (or spread) of the dataset. That is, along a \(j^{th}\) dimension, the variance explained should be maximum. Let \([a_{j1}, a_{j2}, a_{j3},...,a_{jn}] = <X_c, U_j>\) are the projections of the data point in the input space to the \(j^{th}\) dimension of the new feature space. Let \(\sigma_j^2\) is the variance along the \(j^{th}\) dimension in the new feature space. It is to be noted that as the origin was shifted to the centroid of the dataset, \[\mu_j = \frac{1}{n}\sum_{i=1}^n a_{ji} =0 \]

Now, we can calculate the variance of \([a_{ij}]\ \forall i \in [i,2,3,...,n]\) as: 

$$ \begin{align} \sigma_j^2 & = \frac{1}{n}\sum_{i=1}^n (a_{ij}-\mu_j)^2 \\ & = \frac{1}{n}\sum_{i=1}^n (a_{ij})^2 \\ & = \frac{1}{n} (X_cU_j)^T(X_cU_j) \\ & = \frac{1}{n}U_j^TX_c^TX_cU_j \\ & = U_j^T\left (\frac{X_c^TX_c}{n}\right )U_j \\ & = U_j^T \Sigma U_j \end{align} $$

Note that when the original data is centered, $\frac{X_c^TX_c}{n}$ is the covariance matrix and the covariance matrix is a square symmetric matrix. The objective function is to maximize $\sigma_j^2$ subjected to $U_j^TU_j=1$ because $U_j$ is a unit vector. This is a constrained optimization problem which can be solved by using LaGrange's multiplier method. $$L=U_j^T\Sigma U_j - \alpha_j(U_j^TU_j-1)$$ where $\alpha$ is the Lagrange's Multiplier. The objective is to find $U_j$ that maximizes $\sigma_j^2$ and hence, we need to equate $\frac{\partial L}{\partial U_j}=0$. This leads to $$2\Sigma U_j - 2\alpha_j U_j = 0$$ From here, it is clear that $$\Sigma U_j = \alpha_j U_j$$ The above equation is well known as the eigen decomposition of a square matrix. This means that the principal component $U_j$ is essentially the eigenvector of the covariance matrix. Since the covariance matrix is a square symmetric matrix of size $(m \times m)$, there will be exactly $m$ eigenvectors with $m$ eigenvalues. Putting the value of $\Sigma U_j$ is the expression of $\sigma_j^2$, we see that $$\sigma_j^2=U_j^T\Sigma U_j = U_j^T\alpha U_j=\alpha_j U_j^TU_j = \alpha_j$$ Thus, the eigenvalue represents the amount of variance explained by the respective principal component along the $j^{th}$ dimension. An added benefit from this analysis is that the principal components will be perfectly orthogonal to each other because eigenvectors of a square symmetric matrix are always orthogonal to each other. 

Covariance is not bounded and the variable which has higher variance always tries to dominate over other variables (or features). Hence, instead of covariance, a correlation matrix is preferred which is essentially the covariance matrix of the standardized data. 

The final principal components for the sample data are shown in the figure along with the codes.

# find correlation coefficient of z
cor_mat = np.corrcoef(z, rowvar=False)
eig_val, eig_vec = np.linalg.eig(cor_mat)

x_mean = x.mean()
y_mean = y.mean()

plt.figure(figsize=(6,6))
sns.scatterplot(x=x,y=y)
plt.quiver([x_mean, x_mean], [y_mean, y_mean],  eig_vec[0,:], eig_vec[1,:],
           [3,10], scale=5, cmap='flag')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

 

PCA as a method of dimension reduction

 

After eigen decomposition of the correlation matrix, we also get another interesting outcome. $$\Sigma_{j=1}^m\sigma_j^2 = \Sigma_{j=1}^m\alpha_j = trace(Corr Matrix)=m$$ After PCA, the eigenvalues are sorted in a descending manner and the eigenvectors are also rearranged with respect the sorted eigenvalues. Then we choose $r$ number of principal components such that $$\frac{1}{m}\Sigma_{j=1}^r\alpha_j \ge \epsilon$$ with the minimum value of $r$ so that the total variance retained is at least $\epsilon$ percentage of the total variance. Since $r < m$, with r number of mutually independent dimensions $m$ dimensional dataset is represented with some loss of information. Apart from this method, there is a criterion called the Kaiser criteria that can be used to extract a lesser number of dimensions. As per the Kaiser criteria, only those eigenvectors are extracted for which the corresponding eigenvalues are more than 1. The third method is the choice of the analyst where the analyst chooses the best number of principal components. 

Implicit Assumptions


Since PCA deals with the covariance matrix, it is associated with the implicit assumptions:
  1. The variables should be normally distributed (so that mean and variance make sense)
  2. Principal components are the linear combination of the input variables
I hope that the readers will find this post helpful in understanding the concept.

Comments are welcome!






Thursday, 16 May 2024

Importatnt Probability Distribution (Part 2)

MathJax example

In the previous post we had seen binomial and Poisson distributions. They are discrete in nature. Similarly, continuous probability distributions are also there. One of the most important distribution is normal distribution. Normal distribution for a one dimensional data has two parameters, i.e. mean \(\mu\) and standard deviation \(\sigma\). The probability density function (pdf) is defined as: \[f(x|\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}} e^{[-\frac{(x-\mu)^2}{2\sigma^2}]}\] For a standard normal distribution, \(\mu\) = 0 and \(\sigma\) = 1 which leads to: \[ f(x|0,1) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} \] However, for multivariate scenario, the expression becomes complex. For multivariate normal distribution, we need to look at the covariance matrix \(\Sigma\) and the mean vector \(\mu\). The expression of pdf changes to: \[ f(x|\mu, \sigma) = \frac{1}{{(2\pi)^{d/2}}|\Sigma|^{\frac{1}{2}}}exp[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)]\] where \(d\) is the dimension of the multivariate distribution. Normal distribution has applications in hypothesis testing in statistical inferences. More importantly, when the population standard deviation is known, this distribution can be used effectively to determine if sample means of two samples are statistically significantly different or not. However, in most of the situations, population standard deviation is not known. Hence, we need to work with another distribution that behaves similar to a normal distribution when the sample size is more and yet can be used in situations where the sample size is small (say 10) and more importantly, when the population standard deviation is unknown. Student t distribution has all the required characteristics to deal with the above three situations and hence, for statistical hypothesis tests, student t distribution is more preferred. The equation of student t distribution is given by: \[ \frac{\Gamma \left(\frac{\nu +1}{2}\right)}{\sqrt{\pi \nu}\Gamma(\frac{\nu}{2})}\left(1+\frac{x^2}{\nu}\right)^{-(\frac{\nu+1}{2})}\] where \(\nu\) is the degree of freedom and \(\Gamma[.]\) is the Gamma function. The above function is applicable for one dimensional data. For multivariate situation, the same expression changes to: \[ \frac{\Gamma \left(\frac{\nu +p}{2}\right)}{(\pi \nu)^{\frac{p}{2}}\Gamma(\frac{\nu}{2})}\left(1+\frac{(x-\mu)^T\Sigma^{-1}(x-\mu)}{\nu}\right)^{-(\frac{\nu+p}{2})}\] where \(p\) is the dimension of the data and \(\nu\) is the degree of freedom. If both the multivariate functions of both the distributions are observed carefully, they both have the component \((x-\mu)^T\Sigma^{-1}(x-\mu)\) and this part is, essentially, the squared Mahalanobis distance. The characteristics of this distance is that it is unitless and it is scale-invariant in nature. Hence, if the data under consideration is following a multivariate normal distribution, Mahalanobis distance can  be used to detect the presence of multivariate outliers. 

To understand the use, a synthetic dataset is created with only two variables so that the spread can be shown diagrammatically. The codes for generating data and viewing them is shown below.

import numpy as np
import seaborn as sns
import pandas as pd

n_inliers = 100
n_outliers = 20
n_features = 2

# Generate covariance matrix for inlier data and outlier data
inliers_cov_mat = np.array([[2,0],[0,1]])
outliers_cov_mat = np.array([[5,0],[0,2]])

np.random.seed(1) # Fix the seed for consistent output

# generate inlier and outliers data points
inlier_data = np.random.normal(size=(n_inliers, n_features)) @ inliers_cov_mat
outlier_data = np.random.normal(size=(n_outliers, n_features)) @ outliers_cov_mat

# Join the data points
final_data = pd.DataFrame(np.vstack([inlier_data, outlier_data]), 
                          columns=['x','y'])
final_data['labels'] = np.repeat(['inliers','outliers'],[100,20]) # Labels only for identification

# Generate the scatter plot for visualization
sns.scatterplot(data=final_data, x='x', y='y', hue='labels')

The image shown below shows the scatter plot with normal (or inlier) data points and outliers. Note that not all the marked outliers are actually outliers. Some of the marked outliers are very much within the region of inliers. But there are several outliers existing within the dataset. The objective here is to identify them using statistical outlier detection method, i.e., by using Mahalanobis Distance measures. This method of outlier detection is multivariate in nature and hence more appropriate while analyzing multivariate data.

Covariance matrix play an important role in calculating Mahalanobis distance and covariance is also influenced by the presence of outliers. That is why, multiple covariance matrices are estimated based on samples of the data and that covariance matrix is chosen which has the lowest determinant score. The smaller the determinant score, the closely packed the data points, i.e., plausible inliers. Python package scikit-learn has the MinCovDet (MCD) class for the said purpose. It tries to find \(\frac{n_{samples}+n_{features}+1}{2}\) samples which has the lowest determinant score, leading to a pure subset of samples from the existing dataset. We apply the same method on the above mentioned dataset to get a robust estimation of the covariance matrix that can be used subsequently to detect outliers. The code snippet is given below:
 
from sklearn.covariance import MinCovDet

mcd = MinCovDet(random_state=123)
mcd.fit(X=final_data.loc[:,['x','y']])
print(f"The robust covariance matrix is: \n\n{mcd.covariance_}")

# OUTPUT
# ======
# The robust covariance matrix is: 

# [[2.55655394 0.05976092]
#  [0.05976092 0.9295315 ]]

It can be seen that the covariance matrix is similar to that of the covariance matrix created for the inliers. The 'mcd' object also contains the Mahalanobis distance of the data points in the input dataset. This distance is calculated with respect the mean vectors associated with the data points used in calculating the robust covariance matrix. These distances are one dimensional in nature and hence, the outliers can be extracted based on boxplot or IQR based methods. The following code snippet shows the final outlier detection process.

def extract_univariate_outlier_IQR(array_1D):
    percentile_values = np.percentile(array_1D, q=[25,75])
    iqr = percentile_values[1] - percentile_values[0]
    lower_cutoff = percentile_values[0] - 1.5*iqr
    upper_cutoff = percentile_values[1] + 1.5*iqr
    out = np.where((array_1D > upper_cutoff) | (array_1D < lower_cutoff), 1, 0 )
    return out
    

outliers = extract_univariate_outlier_IQR(mcd.dist_)

# Generate the scatter plot for visualization
sns.scatterplot(data=final_data, x='x', y='y', hue=outliers)



The above plot clearly shows the outliers. It is to be noted that this process is applicable for unimodal distribution. For multimodal distribution, this process would be erroneous. In such scenarios, other methods such as Local Outlier Factor, Isolation Forest or even SVM one class novelty detection can be used.

In this post, I have tried to connect use of normal distribution to one important analysis in machine learning, i.e., outlier detection. I hope this would be helpful.  

EM Algorithm and its usage (Part 2) EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is q...