In data analysis, missing values create lots of troubles. There are several algorithms which are simply incapable of handling missing values properly and fail during run-time. Take the case of KMeans clustering. Presence of missing values will not allow to estimate the centroid figures properly and that would lead to either total failure or simply unreliable (or potentially biased) estimates of the centroid figures. Similarly, in linear regression, if missing values are encountered, it tries to remove the entire row containing the missing value before estimating the model parameters and that leads to unreliable parameter estimates. Random Forest, a more accurate machine learning method, is also incapable of handling missing values unless it replaces the missing values through some means. Xgboost, LightGBM etc. are capable of handling missing values due to the presence of provisions. They put data points containing missing values either on the left hand side bucket or on the right-hand side bucket such that the loss is minimized. This is possible because they use a loss function which is quite different from that of a normal decision tree. But, in general, missing values are nothing but menace to data analytics. Hence, there are quite a few methods available to deal with this problem.
Simple methods
Sometimes, dealing with simple missing values is as simple as dropping the rows containing the missing values. This is called list-wise deletion. This method is preferred when we see that the deleted number of rows is less than 5% of the total number of rows. It is, a kind of, rule of thumb. However, if the percentage is more than 5%, then missing values are imputed. It is to be kept in mind that missing values are, sometimes, legitimate, and they must NOT be imputed. This is commonly seen in the responses of large scale surveys where funneling questions are asked such as "If the answer to Q4 is Yes then go to Q7". In such situations, Q5 and Q6 are bound to get missing values. An analyst must not try to impute these missing values because that would distort the responses significantly. Hence, care must be taken before imputing these type of missing values.
Before handling missing values, we need to look at what type of analyses are required to be done. There are some calculations (bi-variate analysis) where only two variables (or vectors) are involved. Common examples of such analyses are, correlation analysis and pairwise distance calculation. Since, at a time, only two variables are involved, they can use pair-wise deletion (i.e. list-wise deletion for the pair of variables) method instead of list-wise deletion and retain a significant amount of information which would have, otherwise, lost in the later process. Hence, any algorithm which uses correlation matrix (e.g. PCA) or distance matrix (MDS), can get benefits from such a method of missing value treatment. However, it becomes difficult to generalize the outcomes of these processes because, for the unseen data with missing values, the correlation matrix or the distance matrix could be significantly different. That is why, any such deletion method must be administered rather carefully.
Usually, missing values contained in numeric variables are imputed using simple techniques such as impute by series means and impute by series median. Mean is a characteristic point of a distribution and when the distribution of the variable is not significantly different from a normal distribution, mean based imputation is preferred. But, if the distribution is skewed, median is a better choice because median is robust compared to mean. Mean value gets highly affected by the presence of outliers. For discrete value variables (categorical variables), the simplest method is to impute the missing values by the mode value (or the most frequent value).
Simple methods work, and they are easy to implement. However, while imputing by the simple methods, the relationships among the variables are completely ignored. This could also induce some biases. That is why, model based methods are also required to be explored.
Model based methods
There are many model based approaches available. In this post, I shall discuss the iterative method which imputes missing values in an iterative manner. The process is explained below.
Step 1: Impute missing values with either series mean or median for numeric variables and series mode for discrete data. Keep track of the row indices of the missing values for each variable.
Step 2: Arrange the variables in the ascending order of occurrence of missing values
Step 3: Start imputation with the variable having the lowest number of missing values. Make it the dependent variable (or target variable) and make all other variables (except the target variable) as the predictor variable.
Step 4: Convert the imputed values to missing values once again for the dependent variable only.
Step 5: Train a statistical model or a machine learning algorithm to predict the values of the dependent variable based on the non-missing values. Use the model to predict the values of the missing values of the dependent variable. Use regression model for numeric variable and classification model for discrete variable.
Step 6: Go to the next variable and follow step 2 to step 5
Step 7: When all the variables are imputed, one iteration (or epoch) is over. Repeat the entire process prefixed number of iterations (or epochs). Sometimes, to make sure that the predicted values are not too far off from the actual value, a predictive mean matching algorithm is also used at step 4. However, in this blog, we would not use it (mainly to reduce the complexity).
Python code
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import make_classification from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer from collections import Counter from lightgbm import LGBMRegressor,LGBMClassifier from tqdm import tqdm, tqdm_notebook
class LGBM_Impute(): def __init__(self, dataset, target_var=None, param_lgbm_reg = None, param_lgbm_cls = None, n_epoch=5): self.dataset = dataset self.taget = target_var self.param_lgbm_cls = param_lgbm_cls self.param_lgbm_reg = param_lgbm_cls self.n_epoch = n_epoch self.data_mask = self.dataset.isnull() def label_encode(self, x): unique_labels = list(np.sort(np.unique(list(x)))) unique_labels = [y for y in unique_labels if y != 'nan'] print(unique_labels) return {k:v for k, v in zip(unique_labels, list(range(len(unique_labels))))} def impute(self): # var_types = [self.dataset[x].dtype for x in self.dataset.columns] cat_vars_index = [index for index in range(self.dataset.shape[1]) if self.dataset.iloc[:,index].dtype == 'object'] num_vars_index = [index for index in range(self.dataset.shape[1]) if self.dataset.iloc[:,index].dtype != 'object'] self.data_encoded = self.dataset.copy() original_var_order = self.dataset.columns n_missing = self.data_mask.sum(axis=0) sorted_indices = np.argsort(n_missing) self.dataset = self.dataset.iloc[:, sorted_indices] self.encode_list = [] for col_index in cat_vars_index: label_map = self.label_encode(self.data_encoded.iloc[:,col_index]) self.data_encoded.iloc[:,col_index] = self.data_encoded.iloc[:,col_index].map(label_map) self.encode_list.append((col_index, label_map)) self.key_value_pair = {k:v for k, v in self.encode_list} # self.value_key_pair = {v:k for k, v in self.encode_list} for col in range(self.data_encoded.shape[1]): if col in cat_vars_index: C = Counter(self.data_encoded.iloc[:,col]) self.data_encoded.iloc[:,col].replace(np.nan, list(C.keys())[0], inplace=True) else: M = self.data_encoded.iloc[:,col].median() self.data_encoded.iloc[:,col].replace(np.nan, M, inplace=True) if self.param_lgbm_reg == None: reg = LGBMRegressor() else: reg = LGBMRegressor(**param_lgbm_reg) if self.param_lgbm_cls == None: cls = LGBMClassifier() else: cls = LGBMClassifier(**param_lgbm_cls) for epoch in tqdm(range(self.n_epoch)): for col in range(self.data_encoded.shape[1]): if self.data_mask.iloc[:,col].sum() == 0: continue else: df_train = self.data_encoded[~self.data_mask.iloc[:,col]] df_test = self.data_encoded[self.data_mask.iloc[:,col]] df_test.drop(self.data_encoded.columns[col], axis=1, inplace=True) # print(df_test.shape) X = df_train.drop(self.data_encoded.columns[col], axis=1) Y = df_train.iloc[:,col] if col in cat_vars_index: cls.fit(X,Y) pred = cls.predict(df_test) df_test[self.data_encoded.columns[col]] = pred self.data_encoded = pd.concat([df_train, df_test]) else: reg.fit(X,Y) pred = reg.predict(df_test) df_test[self.data_encoded.columns[col]] = pred self.data_encoded = pd.concat([df_train, df_test]) # print("End of epoch {}".format(epoch)) for col in range(self.data_encoded.shape[1]): if col in cat_vars_index: value_key_pair = {v:k for k, v in self.key_value_pair[col].items()} self.data_encoded.iloc[:,col] = self.data_encoded.iloc[:,col].map(value_key_pair) final_imputed_data = self.data_encoded.sort_index() return final_imputed_data[original_var_order]
Performance of imputation is very difficult to gauge due to non availability of correct value. Readers can try to evaluate the performance by artificially creating missing values and using the routine to impute the missing values. Thereafter, the imputed values can be compared with actual values to evaluateits performance. LGBM is a flexible ML algorithm and hence, choice of hyperparameters will have their impacts on the final quality of imputation.
I hope that you will find this blog informative.
Comments are welcome..
No comments:
Post a Comment