Analytics for Everyone: Understanding the math behind Principal Component Analysis (PCA)

Let us consider a dataset D in an $m <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>m</mi></math>$ dimensional feature space having $n <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>n</mi></math>$ samples. Mathematically, $D = {x i, y i} <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>D</mi><mo>=</mo><mo fence="false" stretchy="false">{</mo><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo><msub><mi>y</mi><mi>i</mi></msub><mo fence="false" stretchy="false">}</mo></math>$ where $x i \in R m <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mi>i</mi></msub><mo>\in</mo><msup><mi>R</mi><mi>m</mi></msup></math>$ . PCA is unsupervised in nature and hence it does not involve $y i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>y</mi><mi>i</mi></msub></math>$ . It is quite possible that the features are having some level of linear correlation and hence, from pure mathematical consideration, the features are not orthogonal to each other. Some of the features may have very high level of multicollinearity whereas the other features may stay reasonably uncorrelated with each other. In PCA, we must consider linear correlation only (i.e., Pearson Correlation Coefficient). PCA can be considered as the process of extracting new features which are aligned along the direction of the stretch (or spread) of the data points. Let us try to understand the same with codes and diagrams. For better understanding, let us focus on a 2D space. The code below creates the datapoints and the plot shows the scatter.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(20)
x = np.random.normal(10,2, size=10000)
y = 4*x + np.random.normal(2,4, size=10000)

z = np.c_[x,y]

sns.scatterplot(x=x,y=y)
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

It is quite evident that the relationship between X and Y is linear. X and Y are orthogonal in this diagram and from a linear regression point of view, this scatter is desirable when Y is the target variable and X is the predictor variable. But, if both X and Y are predictor variables, then this scatter is going to create trouble due to the concept of multicollinearity of features. In the pattern recognition concept, it is quite common to transfer the data from an existing (or input) feature space to a new (or mapped) feature space. It is popularly known as feature engineering. So, here comes the question, "What is the meaningful way of shifting the data from the current input space to a new feature space?". For instance, have a look at the following new feature spaces.

Out of the four different feature spaces, which one is most meaningful? The answer is, probably the first one. This is because of the fact that the features are trying to align them along the direction of the spread of the data. But, can it be optimized even more? The answer is yes. We apply PCA.

When we want to transfer data points from one feature space to another feature space with linear mapping (i.e., by using a linear function), we need a transformation matrix. Let us assume that U is the required transformation matrix such that,

U = [| | | | u i u 2 u 3 . . u m | | | |] <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>U</mi><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mtable rowspacing="4pt" columnspacing="1em"><mtr><mtd><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow></mtd><mtd><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow></mtd><mtd><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow></mtd><mtd></mtd><mtd></mtd><mtd><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow></mtd></mtr><mtr><mtd><msub><mi>u</mi><mi>i</mi></msub></mtd><mtd><msub><mi>u</mi><mn>2</mn></msub></mtd><mtd><msub><mi>u</mi><mn>3</mn></msub></mtd><mtd><mo>.</mo></mtd><mtd><mo>.</mo></mtd><mtd><msub><mi>u</mi><mi>m</mi></msub></mtd></mtr><mtr><mtd><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow></mtd><mtd><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow></mtd><mtd><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow></mtd><mtd></mtd><mtd></mtd><mtd><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>

where

u i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>u</mi><mi>i</mi></msub></math>

is a column vector with unit length. It is also to be understood that

u i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>u</mi><mi>i</mi></msub></math>

is a vector representing the

i t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>i</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>

axis of the new feature space. If a simple dot product is taken between

X i = [x i 1, x i 2, x i 3, . . ., x i m] <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mi>i</mi></msub><mo>=</mo><mo stretchy="false">[</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mn>1</mn></mrow></msub><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mn>2</mn></mrow></msub><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mn>3</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>x</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>m</mi></mrow></msub><mo stretchy="false">]</mo></math>

and

U <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>U</mi></math>

, the resultant outcome will be the projection of the

i t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>i</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>

data point in the input space to the

i t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>i</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>

data point in the new feature space. The data point will remain in the same location, only the coordinates will change due to feature space mapping. Since the pattern within the data will remain the same irrespective of such linear feature space mapping, the pattern will also remain unchanged with shifting of origin to any arbitrary location linearly. Hence, it is convenient to shift the origin to the centroid of the dataset. In the above image also, the same is done for the new features, i.e., the origin of the new feature space is located at the centroid of the dataset. Let

X c <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>X</mi><mi>c</mi></msub></math>

be the centered dataset such that the origin is located at the centroid of the dataset.

Now we can focus on the objective of PCA. Our objective is to align the new features along the stretch (or spread) of the dataset. That is, along a

j t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>j</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>

dimension, the variance explained should be maximum. Let

[a j 1, a j 2, a j 3, . . ., a j n] =< X c, U j > <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>j</mi><mn>1</mn></mrow></msub><mo>,</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>j</mi><mn>2</mn></mrow></msub><mo>,</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>j</mi><mn>3</mn></mrow></msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>j</mi><mi>n</mi></mrow></msub><mo stretchy="false">]</mo><mo>=&lt;</mo><msub><mi>X</mi><mi>c</mi></msub><mo>,</mo><msub><mi>U</mi><mi>j</mi></msub><mo>&gt;</mo></math>

are the projections of the data point in the input space to the

j t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>j</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>

dimension of the new feature space. Let

σ 2 j <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>σ</mi><mi>j</mi><mn>2</mn></msubsup></math>

is the variance along the

j t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>j</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>

dimension in the new feature space. It is to be noted that as the origin was shifted to the centroid of the dataset,

μj=1nn∑i=1aji=0<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>μ</mi><mi>j</mi></msub><mo>=</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>j</mi><mi>i</mi></mrow></msub><mo>=</mo><mn>0</mn></math>

Now, we can calculate the variance of

[a i j] \forall i \in [i, 2, 3, . . ., n] <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">[</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>j</mi></mrow></msub><mo stretchy="false">]</mo><mtext> </mtext><mi mathvariant="normal">\forall</mi><mi>i</mi><mo>\in</mo><mo stretchy="false">[</mo><mi>i</mi><mo>,</mo><mn>2</mn><mo>,</mo><mn>3</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><mi>n</mi><mo stretchy="false">]</mo></math>

as:

σ2j=1nn∑i=1(aij−μj)2=1nn∑i=1(aij)2=1n(XcUj)T(XcUj)=1nUTjXTcXcUj=UTj(XTcXcn)Uj=UTjΣUj<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mtable displaystyle="true" columnalign="right left" columnspacing="0em" rowspacing="3pt"><mtr><mtd><msubsup><mi>σ</mi><mi>j</mi><mn>2</mn></msubsup></mtd><mtd><mi></mi><mo>=</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mo stretchy="false">(</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>j</mi></mrow></msub><mo>−</mo><msub><mi>μ</mi><mi>j</mi></msub><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mo stretchy="false">(</mo><msub><mi>a</mi><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>j</mi></mrow></msub><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><mo stretchy="false">(</mo><msub><mi>X</mi><mi>c</mi></msub><msub><mi>U</mi><mi>j</mi></msub><msup><mo stretchy="false">)</mo><mi>T</mi></msup><mo stretchy="false">(</mo><msub><mi>X</mi><mi>c</mi></msub><msub><mi>U</mi><mi>j</mi></msub><mo stretchy="false">)</mo></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><mfrac><mn>1</mn><mi>n</mi></mfrac><msubsup><mi>U</mi><mi>j</mi><mi>T</mi></msubsup><msubsup><mi>X</mi><mi>c</mi><mi>T</mi></msubsup><msub><mi>X</mi><mi>c</mi></msub><msub><mi>U</mi><mi>j</mi></msub></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><msubsup><mi>U</mi><mi>j</mi><mi>T</mi></msubsup><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mfrac><mrow><msubsup><mi>X</mi><mi>c</mi><mi>T</mi></msubsup><msub><mi>X</mi><mi>c</mi></msub></mrow><mi>n</mi></mfrac><mo data-mjx-texclass="CLOSE">)</mo></mrow><msub><mi>U</mi><mi>j</mi></msub></mtd></mtr><mtr><mtd></mtd><mtd><mi></mi><mo>=</mo><msubsup><mi>U</mi><mi>j</mi><mi>T</mi></msubsup><mi mathvariant="normal">Σ</mi><msub><mi>U</mi><mi>j</mi></msub></mtd></mtr></mtable></math>

Note that when the original data is centered,

XTcXcn<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mrow><msubsup><mi>X</mi><mi>c</mi><mi>T</mi></msubsup><msub><mi>X</mi><mi>c</mi></msub></mrow><mi>n</mi></mfrac></math>

is the covariance matrix and the covariance matrix is a square symmetric matrix. The objective function is to maximize

σ 2 j <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>σ</mi><mi>j</mi><mn>2</mn></msubsup></math>

subjected to

U T j U j = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>U</mi><mi>j</mi><mi>T</mi></msubsup><msub><mi>U</mi><mi>j</mi></msub><mo>=</mo><mn>1</mn></math>

because

U j <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>U</mi><mi>j</mi></msub></math>

is a unit vector. This is a constrained optimization problem which can be solved by using LaGrange's multiplier method.

L = U T j Σ U j - α j (U T j U j - 1) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>L</mi><mo>=</mo><msubsup><mi>U</mi><mi>j</mi><mi>T</mi></msubsup><mi mathvariant="normal">Σ</mi><msub><mi>U</mi><mi>j</mi></msub><mo>-</mo><msub><mi>α</mi><mi>j</mi></msub><mo stretchy="false">(</mo><msubsup><mi>U</mi><mi>j</mi><mi>T</mi></msubsup><msub><mi>U</mi><mi>j</mi></msub><mo>-</mo><mn>1</mn><mo stretchy="false">)</mo></math>

where

α <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>α</mi></math>

is the Lagrange's Multiplier. The objective is to find

U j <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>U</mi><mi>j</mi></msub></math>

that maximizes

σ 2 j <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>σ</mi><mi>j</mi><mn>2</mn></msubsup></math>

and hence, we need to equate

∂L∂Uj=0<math xmlns="http://www.w3.org/1998/Math/MathML"><mfrac><mrow><mi>∂</mi><mi>L</mi></mrow><mrow><mi>∂</mi><msub><mi>U</mi><mi>j</mi></msub></mrow></mfrac><mo>=</mo><mn>0</mn></math>

. This leads to

2 Σ U j - 2 α j U j = 0 <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mn>2</mn><mi mathvariant="normal">Σ</mi><msub><mi>U</mi><mi>j</mi></msub><mo>-</mo><mn>2</mn><msub><mi>α</mi><mi>j</mi></msub><msub><mi>U</mi><mi>j</mi></msub><mo>=</mo><mn>0</mn></math>

From here, it is clear that

Σ U j = α j U j <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi mathvariant="normal">Σ</mi><msub><mi>U</mi><mi>j</mi></msub><mo>=</mo><msub><mi>α</mi><mi>j</mi></msub><msub><mi>U</mi><mi>j</mi></msub></math>

The above equation is well known as the eigen decomposition of a square matrix. This means that the principal component

U j <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>U</mi><mi>j</mi></msub></math>

is essentially the eigenvector of the covariance matrix. Since the covariance matrix is a square symmetric matrix of size

(m \times m) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo><mi>m</mi><mo>\times</mo><mi>m</mi><mo stretchy="false">)</mo></math>

, there will be exactly

m <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>m</mi></math>

eigenvectors with

m <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>m</mi></math>

eigenvalues. Putting the value of

Σ U j <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Σ</mi><msub><mi>U</mi><mi>j</mi></msub></math>

is the expression of

σ 2 j <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>σ</mi><mi>j</mi><mn>2</mn></msubsup></math>

, we see that

σ 2 j = U T j Σ U j = U T j α U j = α j U T j U j = α j <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msubsup><mi>σ</mi><mi>j</mi><mn>2</mn></msubsup><mo>=</mo><msubsup><mi>U</mi><mi>j</mi><mi>T</mi></msubsup><mi mathvariant="normal">Σ</mi><msub><mi>U</mi><mi>j</mi></msub><mo>=</mo><msubsup><mi>U</mi><mi>j</mi><mi>T</mi></msubsup><mi>α</mi><msub><mi>U</mi><mi>j</mi></msub><mo>=</mo><msub><mi>α</mi><mi>j</mi></msub><msubsup><mi>U</mi><mi>j</mi><mi>T</mi></msubsup><msub><mi>U</mi><mi>j</mi></msub><mo>=</mo><msub><mi>α</mi><mi>j</mi></msub></math>

Thus, the eigenvalue represents the amount of variance explained by the respective principal component along the

j t h <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>j</mi><mrow data-mjx-texclass="ORD"><mi>t</mi><mi>h</mi></mrow></msup></math>

dimension. An added benefit from this analysis is that the principal components will be perfectly orthogonal to each other because eigenvectors of a square symmetric matrix are always orthogonal to each other.

Covariance is not bounded and the variable which has higher variance always tries to dominate over other variables (or features). Hence, instead of covariance, a correlation matrix is preferred which is essentially the covariance matrix of the standardized data.

The final principal components for the sample data are shown in the figure along with the codes.

# find correlation coefficient of z
cor_mat = np.corrcoef(z, rowvar=False)
eig_val, eig_vec = np.linalg.eig(cor_mat)

x_mean = x.mean()
y_mean = y.mean()

plt.figure(figsize=(6,6))
sns.scatterplot(x=x,y=y)
plt.quiver([x_mean, x_mean], [y_mean, y_mean],  eig_vec[0,:], eig_vec[1,:],
           [3,10], scale=5, cmap='flag')
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

PCA as a method of dimension reduction

After eigen decomposition of the correlation matrix, we also get another interesting outcome.

Σ m j = 1 σ 2 j = Σ m j = 1 α j = t r a c e (C o r r M a t r i x) = m <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msubsup><mi mathvariant="normal">Σ</mi><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>m</mi></msubsup><msubsup><mi>σ</mi><mi>j</mi><mn>2</mn></msubsup><mo>=</mo><msubsup><mi mathvariant="normal">Σ</mi><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>m</mi></msubsup><msub><mi>α</mi><mi>j</mi></msub><mo>=</mo><mi>t</mi><mi>r</mi><mi>a</mi><mi>c</mi><mi>e</mi><mo stretchy="false">(</mo><mi>C</mi><mi>o</mi><mi>r</mi><mi>r</mi><mi>M</mi><mi>a</mi><mi>t</mi><mi>r</mi><mi>i</mi><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mi>m</mi></math>

After PCA, the eigenvalues are sorted in a descending manner and the eigenvectors are also rearranged with respect the sorted eigenvalues. Then we choose

r <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi></math>

number of principal components such that

1mΣrj=1αj≥ϵ<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mfrac><mn>1</mn><mi>m</mi></mfrac><msubsup><mi mathvariant="normal">Σ</mi><mrow data-mjx-texclass="ORD"><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>r</mi></msubsup><msub><mi>α</mi><mi>j</mi></msub><mo>≥</mo><mi>ϵ</mi></math>

with the minimum value of

r <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi></math>

so that the total variance retained is at least

ϵ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϵ</mi></math>

percentage of the total variance. Since

r < m <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>r</mi><mo>&lt;</mo><mi>m</mi></math>

, with r number of mutually independent dimensions

m <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>m</mi></math>

dimensional dataset is represented with some loss of information. Apart from this method, there is a criterion called the Kaiser criteria that can be used to extract a lesser number of dimensions. As per the Kaiser criteria, only those eigenvectors are extracted for which the corresponding eigenvalues are more than 1. The third method is the choice of the analyst where the analyst chooses the best number of principal components.

Implicit Assumptions

Since PCA deals with the covariance matrix, it is associated with the implicit assumptions:

The variables should be normally distributed (so that mean and variance make sense)
Principal components are the linear combination of the input variables

I hope that the readers will find this post helpful in understanding the concept.

Friday, 17 May 2024

Understanding the math behind Principal Component Analysis (PCA)

PCA as a method of dimension reduction

Implicit Assumptions

No comments:

Post a Comment

Let us understand Logistic Regression (Part 2)

Total Pageviews