Analytics for Everyone: Importatnt Probability Distribution (Part 2)

MathJax example

In the previous post we had seen binomial and Poisson distributions. They are discrete in nature. Similarly, continuous probability distributions are also there. One of the most important distribution is normal distribution. Normal distribution for a one dimensional data has two parameters, i.e. mean \(\mu\) and standard deviation \(\sigma\). The probability density function (pdf) is defined as: \[f(x|\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}} e^{[-\frac{(x-\mu)^2}{2\sigma^2}]}\] For a standard normal distribution, \(\mu\) = 0 and \(\sigma\) = 1 which leads to: \[ f(x|0,1) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} \] However, for multivariate scenario, the expression becomes complex. For multivariate normal distribution, we need to look at the covariance matrix \(\Sigma\) and the mean vector \(\mu\). The expression of pdf changes to: \[ f(x|\mu, \sigma) = \frac{1}{{(2\pi)^{d/2}}|\Sigma|^{\frac{1}{2}}}exp[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)]\] where \(d\) is the dimension of the multivariate distribution. Normal distribution has applications in hypothesis testing in statistical inferences. More importantly, when the population standard deviation is known, this distribution can be used effectively to determine if sample means of two samples are statistically significantly different or not. However, in most of the situations, population standard deviation is not known. Hence, we need to work with another distribution that behaves similar to a normal distribution when the sample size is more and yet can be used in situations where the sample size is small (say 10) and more importantly, when the population standard deviation is unknown. Student t distribution has all the required characteristics to deal with the above three situations and hence, for statistical hypothesis tests, student t distribution is more preferred. The equation of student t distribution is given by: \[ \frac{\Gamma \left(\frac{\nu +1}{2}\right)}{\sqrt{\pi \nu}\Gamma(\frac{\nu}{2})}\left(1+\frac{x^2}{\nu}\right)^{-(\frac{\nu+1}{2})}\] where \(\nu\) is the degree of freedom and \(\Gamma[.]\) is the Gamma function. The above function is applicable for one dimensional data. For multivariate situation, the same expression changes to: \[ \frac{\Gamma \left(\frac{\nu +p}{2}\right)}{(\pi \nu)^{\frac{p}{2}}\Gamma(\frac{\nu}{2})}\left(1+\frac{(x-\mu)^T\Sigma^{-1}(x-\mu)}{\nu}\right)^{-(\frac{\nu+p}{2})}\] where \(p\) is the dimension of the data and \(\nu\) is the degree of freedom. If both the multivariate functions of both the distributions are observed carefully, they both have the component \((x-\mu)^T\Sigma^{-1}(x-\mu)\) and this part is, essentially, the squared Mahalanobis distance. The characteristics of this distance is that it is unitless and it is scale-invariant in nature. Hence, if the data under consideration is following a multivariate normal distribution, Mahalanobis distance can be used to detect the presence of multivariate outliers.

To understand the use, a synthetic dataset is created with only two variables so that the spread can be shown diagrammatically. The codes for generating data and viewing them is shown below.

import numpy as np
import seaborn as sns
import pandas as pd

n_inliers = 100
n_outliers = 20
n_features = 2

# Generate covariance matrix for inlier data and outlier data
inliers_cov_mat = np.array([[2,0],[0,1]])
outliers_cov_mat = np.array([[5,0],[0,2]])

np.random.seed(1) # Fix the seed for consistent output

# generate inlier and outliers data points
inlier_data = np.random.normal(size=(n_inliers, n_features)) @ inliers_cov_mat
outlier_data = np.random.normal(size=(n_outliers, n_features)) @ outliers_cov_mat

# Join the data points
final_data = pd.DataFrame(np.vstack([inlier_data, outlier_data]), 
                          columns=['x','y'])
final_data['labels'] = np.repeat(['inliers','outliers'],[100,20]) # Labels only for identification

# Generate the scatter plot for visualization
sns.scatterplot(data=final_data, x='x', y='y', hue='labels')

The image shown below shows the scatter plot with normal (or inlier) data points and outliers. Note that not all the marked outliers are actually outliers. Some of the marked outliers are very much within the region of inliers. But there are several outliers existing within the dataset. The objective here is to identify them using statistical outlier detection method, i.e., by using Mahalanobis Distance measures. This method of outlier detection is multivariate in nature and hence more appropriate while analyzing multivariate data.

Covariance matrix play an important role in calculating Mahalanobis distance and covariance is also influenced by the presence of outliers. That is why, multiple covariance matrices are estimated based on samples of the data and that covariance matrix is chosen which has the lowest determinant score. The smaller the determinant score, the closely packed the data points, i.e., plausible inliers. Python package scikit-learn has the MinCovDet (MCD) class for the said purpose. It tries to find \(\frac{n_{samples}+n_{features}+1}{2}\) samples which has the lowest determinant score, leading to a pure subset of samples from the existing dataset. We apply the same method on the above mentioned dataset to get a robust estimation of the covariance matrix that can be used subsequently to detect outliers. The code snippet is given below:

from sklearn.covariance import MinCovDet

mcd = MinCovDet(random_state=123)
mcd.fit(X=final_data.loc[:,['x','y']])
print(f"The robust covariance matrix is: \n\n{mcd.covariance_}")

# OUTPUT
# ======
# The robust covariance matrix is: 

# [[2.55655394 0.05976092]
#  [0.05976092 0.9295315 ]]

It can be seen that the covariance matrix is similar to that of the covariance matrix created for the inliers. The 'mcd' object also contains the Mahalanobis distance of the data points in the input dataset. This distance is calculated with respect the mean vectors associated with the data points used in calculating the robust covariance matrix. These distances are one dimensional in nature and hence, the outliers can be extracted based on boxplot or IQR based methods. The following code snippet shows the final outlier detection process.

def extract_univariate_outlier_IQR(array_1D):
    percentile_values = np.percentile(array_1D, q=[25,75])
    iqr = percentile_values[1] - percentile_values[0]
    lower_cutoff = percentile_values[0] - 1.5*iqr
    upper_cutoff = percentile_values[1] + 1.5*iqr
    out = np.where((array_1D > upper_cutoff) | (array_1D < lower_cutoff), 1, 0 )
    return out
    

outliers = extract_univariate_outlier_IQR(mcd.dist_)

# Generate the scatter plot for visualization
sns.scatterplot(data=final_data, x='x', y='y', hue=outliers)

The above plot clearly shows the outliers. It is to be noted that this process is applicable for unimodal distribution. For multimodal distribution, this process would be erroneous. In such scenarios, other methods such as Local Outlier Factor, Isolation Forest or even SVM one class novelty detection can be used.

In this post, I have tried to connect use of normal distribution to one important analysis in machine learning, i.e., outlier detection. I hope this would be helpful.

Thursday, 16 May 2024

Importatnt Probability Distribution (Part 2)

No comments:

Post a Comment

Let us understand Logistic Regression (Part 2)

Total Pageviews