Monday, 28 June 2021

Missing value Imputation

In data analysis, missing values create lots of troubles. There are several algorithms which are simply incapable of handling missing values properly and fail during run-time. Take the case of KMeans clustering. Presence of missing values will not allow to estimate the centroid figures properly and that would lead to either total failure or simply unreliable (or potentially biased) estimates of the centroid figures. Similarly, in linear regression, if missing values are encountered, it tries to remove the entire row containing the missing value before estimating the model parameters and that leads to unreliable parameter estimates. Random Forest, a more accurate machine learning method, is also incapable of handling missing values unless it replaces the missing values through some means. Xgboost, LightGBM etc. are capable of handling missing values due to the presence of provisions. They put data points containing missing values either on the left hand side bucket or on the right-hand side bucket such that the loss is minimized. This is possible because they use a loss function which is quite different from that of a normal decision tree. But, in general, missing values are nothing but menace to data analytics. Hence, there are quite a few methods available to deal with this problem.

Simple methods

Sometimes, dealing with simple missing values is as simple as dropping the rows containing the missing values. This is called list-wise deletion. This method is preferred when we see that the deleted number of rows is less than 5% of the total number of rows. It is, a kind of, rule of thumb. However, if the percentage is more than 5%, then missing values are imputed. It is to be kept in mind that missing values are, sometimes, legitimate, and they must NOT be imputed. This is commonly seen in the responses of large scale surveys where funneling questions are asked such as "If the answer to Q4 is Yes then go to Q7". In such situations, Q5 and Q6 are bound to get missing values. An analyst must not try to impute these missing values because that would distort the responses significantly. Hence, care must be taken before imputing these type of missing values. 

Before handling missing values, we need to look at what type of analyses are required to be done. There are some calculations (bi-variate analysis) where only two variables (or vectors) are involved. Common examples of such analyses are, correlation analysis and pairwise distance calculation. Since, at a time, only two variables are involved, they can use pair-wise deletion (i.e. list-wise deletion for the pair of variables) method instead of list-wise deletion and retain a significant amount of information which would have, otherwise, lost in the later process. Hence, any algorithm which uses correlation matrix (e.g. PCA) or distance matrix (MDS), can get benefits from such a method of missing value treatment. However, it becomes difficult to generalize the outcomes of these processes because, for the unseen data with missing values, the correlation matrix or the distance matrix could be significantly different. That is why, any such deletion method must be administered rather carefully. 

Usually, missing values contained in numeric variables are imputed using simple techniques such as impute by series means and impute by series median. Mean is a characteristic point of a distribution and when the distribution of the variable is not significantly different from a normal distribution, mean based imputation is preferred. But, if the distribution is skewed, median is a better choice because median is robust compared to mean. Mean value gets highly affected by the presence of outliers. For discrete value variables (categorical variables), the simplest method is to impute the missing values by the mode value (or the most frequent value). 

Simple methods work, and they are easy to implement. However, while imputing by the simple methods, the relationships among the variables are completely ignored. This could also induce some biases. That is why, model based methods are also required to be explored.

Model based methods

There are many model based approaches available. In this post, I shall discuss the iterative method which imputes missing values in an iterative manner. The process is explained below.

Step 1: Impute missing values with either series mean or median for numeric variables and series mode for discrete data. Keep track of the row indices of the missing values for each variable.

Step 2: Arrange the variables in the ascending order of occurrence of missing values

Step 3: Start imputation with the variable having the lowest number of missing values. Make it the dependent variable (or target variable) and make all other variables (except the target variable) as the predictor variable.

Step 4: Convert the imputed values to missing values once again for the dependent variable only. 

Step 5: Train a statistical model or a machine learning algorithm to predict the values of the dependent variable based on the non-missing values. Use the model to predict the values of the missing values of the dependent variable. Use regression model for numeric variable and classification model for discrete variable.

Step 6: Go to the next variable and follow step 2 to step 5

Step 7: When all the variables are imputed, one iteration (or epoch) is over. Repeat the entire process prefixed number of iterations (or epochs). Sometimes, to make sure that the predicted values are not too far off from the actual value, a predictive mean matching algorithm is also used at step 4. However, in this blog, we would not use it (mainly to reduce the complexity).

Python code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from collections import Counter
from lightgbm import LGBMRegressor,LGBMClassifier
from tqdm import tqdm, tqdm_notebook 

 

class LGBM_Impute():
    def __init__(self, dataset, target_var=None, param_lgbm_reg = None, param_lgbm_cls = None, n_epoch=5):
        self.dataset = dataset
        self.taget = target_var
        self.param_lgbm_cls = param_lgbm_cls
        self.param_lgbm_reg = param_lgbm_cls
        self.n_epoch = n_epoch
        
        self.data_mask = self.dataset.isnull()
        
    
    def label_encode(self, x):
        unique_labels = list(np.sort(np.unique(list(x))))
        unique_labels = [y for y in unique_labels if y != 'nan']
        print(unique_labels)
        return {k:v for k, v in zip(unique_labels, list(range(len(unique_labels))))}
    
    def impute(self):
#         var_types = [self.dataset[x].dtype for x in self.dataset.columns]
        cat_vars_index = [index for index in range(self.dataset.shape[1]) if self.dataset.iloc[:,index].dtype == 'object']
        num_vars_index = [index for index in range(self.dataset.shape[1]) if self.dataset.iloc[:,index].dtype != 'object']
        
        self.data_encoded = self.dataset.copy()
        
        original_var_order = self.dataset.columns
        
        n_missing = self.data_mask.sum(axis=0)
        sorted_indices = np.argsort(n_missing)
        
        self.dataset = self.dataset.iloc[:, sorted_indices]
        
        self.encode_list = []   
        for col_index in cat_vars_index:
            label_map = self.label_encode(self.data_encoded.iloc[:,col_index])
            self.data_encoded.iloc[:,col_index] = self.data_encoded.iloc[:,col_index].map(label_map)
            self.encode_list.append((col_index, label_map))            
            
        self.key_value_pair = {k:v for k, v in self.encode_list}
#         self.value_key_pair = {v:k for k, v in self.encode_list}
        
        for col in range(self.data_encoded.shape[1]):
            if col in cat_vars_index:
                C = Counter(self.data_encoded.iloc[:,col])
                self.data_encoded.iloc[:,col].replace(np.nan, list(C.keys())[0], inplace=True)
            else:
                M = self.data_encoded.iloc[:,col].median()
                self.data_encoded.iloc[:,col].replace(np.nan, M, inplace=True)
                
        
        if self.param_lgbm_reg == None:
            reg = LGBMRegressor()
        else:
            reg = LGBMRegressor(**param_lgbm_reg)
            
        if self.param_lgbm_cls == None:
            cls = LGBMClassifier()
        else:
            cls = LGBMClassifier(**param_lgbm_cls)
            
                
        for epoch in tqdm(range(self.n_epoch)):
            for col in range(self.data_encoded.shape[1]):
                if self.data_mask.iloc[:,col].sum() == 0:
                    continue
                else:
                    df_train = self.data_encoded[~self.data_mask.iloc[:,col]]
                    df_test = self.data_encoded[self.data_mask.iloc[:,col]]
                    
                    df_test.drop(self.data_encoded.columns[col], axis=1, inplace=True)
#                     print(df_test.shape)
                    
                    X = df_train.drop(self.data_encoded.columns[col], axis=1)
                    Y = df_train.iloc[:,col]
                    
                    if col in cat_vars_index:
                        cls.fit(X,Y)
                        pred = cls.predict(df_test)
                        df_test[self.data_encoded.columns[col]] = pred
                        self.data_encoded = pd.concat([df_train, df_test])
                    
                    else:
                        reg.fit(X,Y)
                        pred = reg.predict(df_test)
                        df_test[self.data_encoded.columns[col]] = pred
                        self.data_encoded = pd.concat([df_train, df_test])
#             print("End of epoch {}".format(epoch))
        
        for col in range(self.data_encoded.shape[1]):
            if col in cat_vars_index:
                value_key_pair = {v:k for k, v in self.key_value_pair[col].items()}
                self.data_encoded.iloc[:,col] = self.data_encoded.iloc[:,col].map(value_key_pair)
                
        final_imputed_data = self.data_encoded.sort_index()
        
        return final_imputed_data[original_var_order] 
 

Performance of imputation is very difficult to gauge due to non availability of correct value. Readers can try to evaluate the performance by artificially creating missing values and using the routine to impute the missing values. Thereafter, the imputed values can be compared with actual values to evaluateits performance. LGBM is a flexible ML algorithm and hence, choice of hyperparameters will have their impacts on the final quality of imputation. 

I hope that you will find this blog informative. 

Comments are welcome..

Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do...