Processing math: 100%

Friday, 17 May 2024

Understanding the math behind Principal Component Analysis (PCA)

 Let us consider a dataset D in an m dimensional feature space having n samples. Mathematically, D={xi,yi} where xiRm. PCA is unsupervised in nature and hence it does not involve yi. It is quite possible that the features are having some level of linear correlation and hence, from pure mathematical consideration, the features are not orthogonal to each other. Some of the features may have very high level of multicollinearity whereas the other features may stay reasonably uncorrelated with each other. In PCA, we must consider linear correlation only (i.e., Pearson Correlation Coefficient). PCA can be considered as the process of extracting new features which are aligned along the direction of the stretch (or spread) of the data points. Let us try to understand the same with codes and diagrams. For better understanding, let us focus on a 2D space. The code below creates the datapoints and the plot shows the scatter.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(20)
x = np.random.normal(10,2, size=10000)
y = 4*x + np.random.normal(2,4, size=10000)

z = np.c_[x,y]

sns.scatterplot(x=x,y=y)
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
It is quite evident that the relationship between X and Y is linear. X and Y are orthogonal in this diagram and from a linear regression point of view, this scatter is desirable when Y is the target variable and X is the predictor variable. But, if both X and Y are predictor variables, then this scatter is going to create trouble due to the concept of multicollinearity of features. In the pattern recognition concept, it is quite common to transfer the data from an existing (or input) feature space to a new (or mapped) feature space. It is popularly known as feature engineering. So, here comes the question, "What is the meaningful way of shifting the data from the current input space to a new feature space?". For instance, have a look at the following new feature spaces.
Out of the four different feature spaces, which one is most meaningful? The answer is, probably the first one. This is because of the fact that the features are trying to align them along the direction of the spread of the data. But, can it be optimized even more? The answer is yes. We apply PCA.

When we want to transfer data points from one feature space to another feature space with linear mapping (i.e., by using a linear function), we need a transformation matrix. Let us assume that U is the required transformation matrix such that, U=[||||uiu2u3..um||||]
where ui is a column vector with unit length. It is also to be understood that ui is a vector representing the ith axis of the new feature space. If a simple dot product is taken between Xi=[xi1,xi2,xi3,...,xim] and U, the resultant outcome will be the projection of the ith data point in the input space to the ith data point in the new feature space. The data point will remain in the same location, only the coordinates will change due to feature space mapping. Since the pattern within the data will remain the same irrespective of such linear feature space mapping, the pattern will also remain unchanged with shifting of origin to any arbitrary location linearly. Hence, it is convenient to shift the origin to the centroid of the dataset. In the above image also, the same is done for the new features, i.e., the origin of the new feature space is located at the centroid of the dataset. Let Xc be the centered dataset such that the origin is located at the centroid of the dataset.

Now we can focus on the objective of PCA. Our objective is to align the new features along the stretch (or spread) of the dataset. That is, along a jth dimension, the variance explained should be maximum. Let [aj1,aj2,aj3,...,ajn]=<Xc,Uj> are the projections of the data point in the input space to the jth dimension of the new feature space. Let σ2j is the variance along the jth dimension in the new feature space. It is to be noted that as the origin was shifted to the centroid of the dataset, μj=1nni=1aji=0

Now, we can calculate the variance of [aij] i[i,2,3,...,n] as: 

σ2j=1nni=1(aijμj)2=1nni=1(aij)2=1n(XcUj)T(XcUj)=1nUTjXTcXcUj=UTj(XTcXcn)Uj=UTjΣUj

Note that when the original data is centered, XTcXcn is the covariance matrix and the covariance matrix is a square symmetric matrix. The objective function is to maximize σ2j subjected to UTjUj=1 because Uj is a unit vector. This is a constrained optimization problem which can be solved by using LaGrange's multiplier method. L=UTjΣUjαj(UTjUj1)
where α is the Lagrange's Multiplier. The objective is to find Uj that maximizes σ2j and hence, we need to equate LUj=0. This leads to 2ΣUj2αjUj=0
From here, it is clear that ΣUj=αjUj
The above equation is well known as the eigen decomposition of a square matrix. This means that the principal component Uj is essentially the eigenvector of the covariance matrix. Since the covariance matrix is a square symmetric matrix of size (m×m), there will be exactly m eigenvectors with m eigenvalues. Putting the value of ΣUj is the expression of σ2j, we see that σ2j=UTjΣUj=UTjαUj=αjUTjUj=αj
Thus, the eigenvalue represents the amount of variance explained by the respective principal component along the jth dimension. An added benefit from this analysis is that the principal components will be perfectly orthogonal to each other because eigenvectors of a square symmetric matrix are always orthogonal to each other. 

Covariance is not bounded and the variable which has higher variance always tries to dominate over other variables (or features). Hence, instead of covariance, a correlation matrix is preferred which is essentially the covariance matrix of the standardized data. 

The final principal components for the sample data are shown in the figure along with the codes.

# find correlation coefficient of z
cor_mat = np.corrcoef(z, rowvar=False)
eig_val, eig_vec = np.linalg.eig(cor_mat)

x_mean = x.mean()
y_mean = y.mean()

plt.figure(figsize=(6,6))
sns.scatterplot(x=x,y=y)
plt.quiver([x_mean, x_mean], [y_mean, y_mean],  eig_vec[0,:], eig_vec[1,:],
           [3,10], scale=5, cmap='flag')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

 

PCA as a method of dimension reduction

 

After eigen decomposition of the correlation matrix, we also get another interesting outcome. Σmj=1σ2j=Σmj=1αj=trace(CorrMatrix)=m
After PCA, the eigenvalues are sorted in a descending manner and the eigenvectors are also rearranged with respect the sorted eigenvalues. Then we choose r number of principal components such that 1mΣrj=1αjϵ
with the minimum value of r so that the total variance retained is at least ϵ percentage of the total variance. Since r<m, with r number of mutually independent dimensions m dimensional dataset is represented with some loss of information. Apart from this method, there is a criterion called the Kaiser criteria that can be used to extract a lesser number of dimensions. As per the Kaiser criteria, only those eigenvectors are extracted for which the corresponding eigenvalues are more than 1. The third method is the choice of the analyst where the analyst chooses the best number of principal components. 

Implicit Assumptions


Since PCA deals with the covariance matrix, it is associated with the implicit assumptions:
  1. The variables should be normally distributed (so that mean and variance make sense)
  2. Principal components are the linear combination of the input variables
I hope that the readers will find this post helpful in understanding the concept.

Comments are welcome!






No comments:

Post a Comment

EM Algorithm and its usage (Part 2) EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is q...