In the previous post we had seen binomial and Poisson distributions. They are discrete in nature. Similarly, continuous probability distributions are also there. One of the most important distribution is normal distribution. Normal distribution for a one dimensional data has two parameters, i.e. mean \(\mu\) and standard deviation \(\sigma\). The probability density function (pdf) is defined as: \[f(x|\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}} e^{[-\frac{(x-\mu)^2}{2\sigma^2}]}\] For a standard normal distribution, \(\mu\) = 0 and \(\sigma\) = 1 which leads to: \[ f(x|0,1) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} \] However, for multivariate scenario, the expression becomes complex. For multivariate normal distribution, we need to look at the covariance matrix \(\Sigma\) and the mean vector \(\mu\). The expression of pdf changes to: \[ f(x|\mu, \sigma) = \frac{1}{{(2\pi)^{d/2}}|\Sigma|^{\frac{1}{2}}}exp[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)]\] where \(d\) is the dimension of the multivariate distribution. Normal distribution has applications in hypothesis testing in statistical inferences. More importantly, when the population standard deviation is known, this distribution can be used effectively to determine if sample means of two samples are statistically significantly different or not. However, in most of the situations, population standard deviation is not known. Hence, we need to work with another distribution that behaves similar to a normal distribution when the sample size is more and yet can be used in situations where the sample size is small (say 10) and more importantly, when the population standard deviation is unknown. Student t distribution has all the required characteristics to deal with the above three situations and hence, for statistical hypothesis tests, student t distribution is more preferred. The equation of student t distribution is given by: \[ \frac{\Gamma \left(\frac{\nu +1}{2}\right)}{\sqrt{\pi \nu}\Gamma(\frac{\nu}{2})}\left(1+\frac{x^2}{\nu}\right)^{-(\frac{\nu+1}{2})}\] where \(\nu\) is the degree of freedom and \(\Gamma[.]\) is the Gamma function. The above function is applicable for one dimensional data. For multivariate situation, the same expression changes to: \[ \frac{\Gamma \left(\frac{\nu +p}{2}\right)}{(\pi \nu)^{\frac{p}{2}}\Gamma(\frac{\nu}{2})}\left(1+\frac{(x-\mu)^T\Sigma^{-1}(x-\mu)}{\nu}\right)^{-(\frac{\nu+p}{2})}\] where \(p\) is the dimension of the data and \(\nu\) is the degree of freedom. If both the multivariate functions of both the distributions are observed carefully, they both have the component \((x-\mu)^T\Sigma^{-1}(x-\mu)\) and this part is, essentially, the squared Mahalanobis distance. The characteristics of this distance is that it is unitless and it is scale-invariant in nature. Hence, if the data under consideration is following a multivariate normal distribution, Mahalanobis distance can be used to detect the presence of multivariate outliers.
To understand the use, a synthetic dataset is created with only two variables so that the spread can be shown diagrammatically. The codes for generating data and viewing them is shown below.
import numpy as np import seaborn as sns import pandas as pd n_inliers = 100 n_outliers = 20 n_features = 2 # Generate covariance matrix for inlier data and outlier data inliers_cov_mat = np.array([[2,0],[0,1]]) outliers_cov_mat = np.array([[5,0],[0,2]]) np.random.seed(1) # Fix the seed for consistent output # generate inlier and outliers data points inlier_data = np.random.normal(size=(n_inliers, n_features)) @ inliers_cov_mat outlier_data = np.random.normal(size=(n_outliers, n_features)) @ outliers_cov_mat # Join the data points final_data = pd.DataFrame(np.vstack([inlier_data, outlier_data]), columns=['x','y']) final_data['labels'] = np.repeat(['inliers','outliers'],[100,20]) # Labels only for identification # Generate the scatter plot for visualization sns.scatterplot(data=final_data, x='x', y='y', hue='labels')
from sklearn.covariance import MinCovDet mcd = MinCovDet(random_state=123) mcd.fit(X=final_data.loc[:,['x','y']]) print(f"The robust covariance matrix is: \n\n{mcd.covariance_}") # OUTPUT # ====== # The robust covariance matrix is: # [[2.55655394 0.05976092] # [0.05976092 0.9295315 ]]
def extract_univariate_outlier_IQR(array_1D): percentile_values = np.percentile(array_1D, q=[25,75]) iqr = percentile_values[1] - percentile_values[0] lower_cutoff = percentile_values[0] - 1.5*iqr upper_cutoff = percentile_values[1] + 1.5*iqr out = np.where((array_1D > upper_cutoff) | (array_1D < lower_cutoff), 1, 0 ) return out outliers = extract_univariate_outlier_IQR(mcd.dist_) # Generate the scatter plot for visualization sns.scatterplot(data=final_data, x='x', y='y', hue=outliers)
No comments:
Post a Comment