Friday, 17 May 2024

Understanding the math behind Principal Component Analysis (PCA)

 Let us consider a dataset D in an \(m\) dimensional feature space having \(n\) samples. Mathematically, \(D=\{x_i,y_i\}\) where \(x_i \in R^m\). PCA is unsupervised in nature and hence it does not involve \(y_i\). It is quite possible that the features are having some level of linear correlation and hence, from pure mathematical consideration, the features are not orthogonal to each other. Some of the features may have very high level of multicollinearity whereas the other features may stay reasonably uncorrelated with each other. In PCA, we must consider linear correlation only (i.e., Pearson Correlation Coefficient). PCA can be considered as the process of extracting new features which are aligned along the direction of the stretch (or spread) of the data points. Let us try to understand the same with codes and diagrams. For better understanding, let us focus on a 2D space. The code below creates the datapoints and the plot shows the scatter.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(20)
x = np.random.normal(10,2, size=10000)
y = 4*x + np.random.normal(2,4, size=10000)

z = np.c_[x,y]

sns.scatterplot(x=x,y=y)
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
It is quite evident that the relationship between X and Y is linear. X and Y are orthogonal in this diagram and from a linear regression point of view, this scatter is desirable when Y is the target variable and X is the predictor variable. But, if both X and Y are predictor variables, then this scatter is going to create trouble due to the concept of multicollinearity of features. In the pattern recognition concept, it is quite common to transfer the data from an existing (or input) feature space to a new (or mapped) feature space. It is popularly known as feature engineering. So, here comes the question, "What is the meaningful way of shifting the data from the current input space to a new feature space?". For instance, have a look at the following new feature spaces.
Out of the four different feature spaces, which one is most meaningful? The answer is, probably the first one. This is because of the fact that the features are trying to align them along the direction of the spread of the data. But, can it be optimized even more? The answer is yes. We apply PCA.

When we want to transfer data points from one feature space to another feature space with linear mapping (i.e., by using a linear function), we need a transformation matrix. Let us assume that U is the required transformation matrix such that, \[ U=\left [\matrix {|&|&|&&&| \\ u_i & u_2 & u_3 & . & . & u_m \\ |&|&|&&&| } \right ] \] where \(u_i\) is a column vector with unit length. It is also to be understood that \(u_i\) is a vector representing the \(i^{th}\) axis of the new feature space. If a simple dot product is taken between \(X_i=[x_{i1}, x_{i2}, x_{i3},...,x_{im}]\) and \(U\), the resultant outcome will be the projection of the \(i^{th}\) data point in the input space to the \(i^{th}\) data point in the new feature space. The data point will remain in the same location, only the coordinates will change due to feature space mapping. Since the pattern within the data will remain the same irrespective of such linear feature space mapping, the pattern will also remain unchanged with shifting of origin to any arbitrary location linearly. Hence, it is convenient to shift the origin to the centroid of the dataset. In the above image also, the same is done for the new features, i.e., the origin of the new feature space is located at the centroid of the dataset. Let \(X_c\) be the centered dataset such that the origin is located at the centroid of the dataset.

Now we can focus on the objective of PCA. Our objective is to align the new features along the stretch (or spread) of the dataset. That is, along a \(j^{th}\) dimension, the variance explained should be maximum. Let \([a_{j1}, a_{j2}, a_{j3},...,a_{jn}] = <X_c, U_j>\) are the projections of the data point in the input space to the \(j^{th}\) dimension of the new feature space. Let \(\sigma_j^2\) is the variance along the \(j^{th}\) dimension in the new feature space. It is to be noted that as the origin was shifted to the centroid of the dataset, \[\mu_j = \frac{1}{n}\sum_{i=1}^n a_{ji} =0 \]

Now, we can calculate the variance of \([a_{ij}]\ \forall i \in [i,2,3,...,n]\) as: 

$$ \begin{align} \sigma_j^2 & = \frac{1}{n}\sum_{i=1}^n (a_{ij}-\mu_j)^2 \\ & = \frac{1}{n}\sum_{i=1}^n (a_{ij})^2 \\ & = \frac{1}{n} (X_cU_j)^T(X_cU_j) \\ & = \frac{1}{n}U_j^TX_c^TX_cU_j \\ & = U_j^T\left (\frac{X_c^TX_c}{n}\right )U_j \\ & = U_j^T \Sigma U_j \end{align} $$

Note that when the original data is centered, $\frac{X_c^TX_c}{n}$ is the covariance matrix and the covariance matrix is a square symmetric matrix. The objective function is to maximize $\sigma_j^2$ subjected to $U_j^TU_j=1$ because $U_j$ is a unit vector. This is a constrained optimization problem which can be solved by using LaGrange's multiplier method. $$L=U_j^T\Sigma U_j - \alpha_j(U_j^TU_j-1)$$ where $\alpha$ is the Lagrange's Multiplier. The objective is to find $U_j$ that maximizes $\sigma_j^2$ and hence, we need to equate $\frac{\partial L}{\partial U_j}=0$. This leads to $$2\Sigma U_j - 2\alpha_j U_j = 0$$ From here, it is clear that $$\Sigma U_j = \alpha_j U_j$$ The above equation is well known as the eigen decomposition of a square matrix. This means that the principal component $U_j$ is essentially the eigenvector of the covariance matrix. Since the covariance matrix is a square symmetric matrix of size $(m \times m)$, there will be exactly $m$ eigenvectors with $m$ eigenvalues. Putting the value of $\Sigma U_j$ is the expression of $\sigma_j^2$, we see that $$\sigma_j^2=U_j^T\Sigma U_j = U_j^T\alpha U_j=\alpha_j U_j^TU_j = \alpha_j$$ Thus, the eigenvalue represents the amount of variance explained by the respective principal component along the $j^{th}$ dimension. An added benefit from this analysis is that the principal components will be perfectly orthogonal to each other because eigenvectors of a square symmetric matrix are always orthogonal to each other. 

Covariance is not bounded and the variable which has higher variance always tries to dominate over other variables (or features). Hence, instead of covariance, a correlation matrix is preferred which is essentially the covariance matrix of the standardized data. 

The final principal components for the sample data are shown in the figure along with the codes.

# find correlation coefficient of z
cor_mat = np.corrcoef(z, rowvar=False)
eig_val, eig_vec = np.linalg.eig(cor_mat)

x_mean = x.mean()
y_mean = y.mean()

plt.figure(figsize=(6,6))
sns.scatterplot(x=x,y=y)
plt.quiver([x_mean, x_mean], [y_mean, y_mean],  eig_vec[0,:], eig_vec[1,:],
           [3,10], scale=5, cmap='flag')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

 

PCA as a method of dimension reduction

 

After eigen decomposition of the correlation matrix, we also get another interesting outcome. $$\Sigma_{j=1}^m\sigma_j^2 = \Sigma_{j=1}^m\alpha_j = trace(Corr Matrix)=m$$ After PCA, the eigenvalues are sorted in a descending manner and the eigenvectors are also rearranged with respect the sorted eigenvalues. Then we choose $r$ number of principal components such that $$\frac{1}{m}\Sigma_{j=1}^r\alpha_j \ge \epsilon$$ with the minimum value of $r$ so that the total variance retained is at least $\epsilon$ percentage of the total variance. Since $r < m$, with r number of mutually independent dimensions $m$ dimensional dataset is represented with some loss of information. Apart from this method, there is a criterion called the Kaiser criteria that can be used to extract a lesser number of dimensions. As per the Kaiser criteria, only those eigenvectors are extracted for which the corresponding eigenvalues are more than 1. The third method is the choice of the analyst where the analyst chooses the best number of principal components. 

Implicit Assumptions


Since PCA deals with the covariance matrix, it is associated with the implicit assumptions:
  1. The variables should be normally distributed (so that mean and variance make sense)
  2. Principal components are the linear combination of the input variables
I hope that the readers will find this post helpful in understanding the concept.

Comments are welcome!






No comments:

Post a Comment

EM Algorithm and its usage (Part 2) EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is q...