Analytics for Everyone: January 2020

Overview

Concept of sklearn pipeline

Designing pipelines

Implementing pipelines for better results

Introduction

Machine learning demands quite a lot of works on preparing the data. The complexity of the task varies from situation to situation. A dataset may contain missing values, irregularities and other issues such as high dimensionality of data. Data modeling, in such situations, is difficult and data preparation is a task that can easily take away a considerable amount of time if done manually. Not only that, there are several routine works that are required to be performed before the actual prediction is done by the model. This is where the concept of pipelines comes into the picture because pipelines can be called just like normal API calls and each pipeline can perform repetitive tasks efficiently. Pipeline is available in the Scikit-Learn package and if used properly, it can make the codes more readable and efficient.

In this article, we would see how we can create a simple pipeline using the sklearn package. Then we would also see how to build a slightly more complex pipeline for other tasks.

Pipelines

A pipeline, as the name suggests, is essentially a workflow process that takes in input data and finally, either builds a model or predicts the outcome simply sends out a transformed data. This is achieved by means of a base class that essentially takes in a data frame as input and in all intermediate operations, it tries to perform intermediate operations assuming that each operator has a “fit” and a “transform” method. The last operator can be a transformer (e.g. StandardScaler)or a predictor (e.g. SVM).

The interesting part of using the pipeline is that users can supply separate sets of parameters for all of its intermediate operators. This makes pipelines trainable through hyperparameter tuning operators such as GridSearchCV. Hence, it is preferable to use pipelines in ML while working with python.

Examples

Even though many examples are available on the Internet, it is always good to have more examples with explanations for better understanding. The first example is a simple one where simple preprocessing is done prior to building a model.

Building an ANN model using pipeline

For this example, the Abalone dataset is used from UCI Machine Learning repository. Here, the objective is to predict the age of abalone by predicting the number of rings (the last variable). The dataset has only one categorical variable, i.e. sex. A snapshot is shown below.

To solve this problem, a few packages and classes are required to be loaded in the python environment and loading of those packages is shown below:

import pandas as pd
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split

This is a regression problem and hence multilayer perceptron (MLP) will be used for regression. However, MLP requires that the inputs are normalized. Hence, prior to feeding the data to MLP architecture, the data is required to be normalized. In this example, MinMax normalization is applied. Moreover, categorical data are required to be one hot encoded. So, a simple pipeline for ANN regression will have 3 operators:

One hot encoder (for categorical data)
MinMax normalization / standardization (for numeric data)
MLP regression model

The codes pertaining to custom transformers are shown below:

class dummify(BaseEstimator, TransformerMixin):
    def __init__(self):         # No initialization
        pass
        # self.y = y

    def fit(self, X, y=None):   # Returns nothing essentially
        return self

    def transform(self, X, y=None): # get_dummies() creates One-Hot-Encoding
        return pd.get_dummies(X)    # of categorical data only leaving numeric data

    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)

class normalize(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        # self.y = y
    
    def fit(self, X, y=None):           # Fits MinMax() normalization 
        return minmax_norm.fit(X)

    def transform(self, X, y=None):     # Transforms numeric data using MinMax norm
        return minmax_norm.transform(X)

    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)

The next task is to combine the ANN regression method with the above transformers. Using the pipeline class in sklearn, it is really simple. One needs to simply put instances of these classes in a sequence such that the last instance belongs to the model’s instance. The code below shows the way of creating a pipeline.

ANN_reg = MLPRegressor(hidden_layer_sizes=(10,20),learning_rate='adaptive',
                       solver='adam',max_iter=2000)

ann_reg_pipeline = Pipeline([('dummy',dummify()),
                             ('norm',normalize()),
                             ('model',ANN_reg)])

Once the pipeline is created, training the final model is very easy. A pipeline has both ‘fit’ and ‘predict’ method to train the model inside the pipeline and to make predictions based on the trained model. The code below shows how to split the dataset into a training dataset and test dataset and use the training dataset to train the pipeline and use the pipeline for making predictions for the test data.

x_train, x_test, y_train, y_test = train_test_split(data.loc[:,'sex':'shell_weight'],data['rings'],test_size=0.3, random_state=1234)

ann_reg_pipeline.fit(X=x_train, y=y_train)

Finally making a prediction using the pipeline:

ann_reg_pipeline.predict(X=x_test)
RMSE = (sum((y_test-ann_reg_pipeline.predict(X=x_test))**2)/len(y_test))**(0.5)
print(RMSE)
2.167094347024997

All codes together:

path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'

import pandas as pd
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split

data = pd.read_csv(path, header=None)

columns = ['sex','length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight','rings']

data.columns = columns

data.head()

minmax_norm = MinMaxScaler()

class dummify(BaseEstimator, TransformerMixin):
    def __init__(self):         # No initialization
        pass
        # self.y = y

    def fit(self, X, y=None):   # Returns nothing essentially
        return self

    def transform(self, X, y=None):     # get_dummies() creates One-Hot-Encoding
        return pd.get_dummies(X)        # of categorical data only leaving numeric data

    def fit_transform(self, X, y=None):
        return self.fit(X,y).transform(X)

class normalize(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        # self.y = y
    
    def fit(self, X, y=None):           # Fits MinMax() normalization
        return minmax_norm.fit(X)

    def transform(self, X, y=None):     # Transforms numeric data using MinMax norm
        return minmax_norm.transform(X)

    def fit_transform(self, X, y=None):
        return self.fit(X,y).transform(X)

ANN_reg = MLPRegressor(hidden_layer_sizes=(10,20),learning_rate='adaptive',
                       solver='adam',max_iter=2000)

ann_reg_pipeline = Pipeline([('dummy',dummify()),
                             ('norm',normalize()),
                             ('model',ANN_reg)])

x_train, x_test, y_train, y_test = train_test_split(data.loc[:,'sex':'shell_weight'],data['rings'],test_size=0.3, random_state=1234)

ann_reg_pipeline.fit(X=x_train,y=y_train)

ann_reg_pipeline.predict(X=x_test)

RMSE = (sum((y_test-ann_reg_pipeline.predict(X=x_test))**2)/len(y_test))**(0.5)
print(RMSE)

A more complex pipeline (automated document clustering):

In text mining, document clustering plays an important role. Document clustering involves a few important steps like document cleaning, creation of document-term matrix, truncated singular value decomposition and finally clustering.

Let us import the important packages:

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer 
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

Codes for TFIDF based document-term matrix creation:

tfidf = TfidfTransformer()                                                                                        tf = CountVectorizer(stop_words='english')
tsvd = TruncatedSVD(n_components=500)


class tfidf_transform(BaseEstimator, TransformerMixin): # TFIDF matrix creation
    def __init__(self, key): # Key is the column containing texts
        self.key = key
    
    def fit(self, X, y=None):
        return tfidf.fit(tf.fit_transform(X[self.key]))
    
    def transform(self,X, y=None):
        return tfidf.transform(tf.transform(X[self.key]))
    
    def fit_transform(self, X, y=None):
        return tfidf.fit_transform(tf.fit_transform(X[self.key]))

Codes for LSA using TFIDF matrix:

class lsa(BaseEstimator, TransformerMixin):             # Latent Semantic Analysis
    def __init__(self, dim_frac=0.5, threshold_diff_svd=0.5): # In SVD singular values are arranged in decreasing order. 'threshold_diff_svd' is kept to identify an elbow location from where singular values are tapered off.

        self.dim_frac = dim_frac
        self.threshold_diff_svd = threshold_diff_svd
    
    def fit(self, X, y=None):
        N = X.shape[1]//20
        tsvd = TruncatedSVD(n_components=N)
        tsvd.fit(X)
        S = tsvd.singular_values_
        d = self.threshold_diff_svd + 2
        i = 0
        while d >= self.threshold_diff_svd:
            d = S[i] - S[i+1]
            i += 1
        print("Number of components is {}".format(i))
        tsvd = TruncatedSVD(n_components=i)
        return tsvd.fit(X)
    
    def transform(self, X, y=None):
        return tsvd.transform(X)
    
    def fit_transform(self, X, y=None):
        return tsvd.fit_transform(X)

Codes for KMeans clustering:

class km_cluster(BaseEstimator, TransformerMixin):        # KMeans clustering
    def __init__(self, n_cluster = 'auto', scale= True):
        self.n_cluster = n_cluster
        self.scale = scale
    
    def fit(self, X, y=None):
        if self.scale:
            sd = StandardScaler()
            X1 = sd.fit_transform(X)
        else:
            X1 = X.copy()
        if self.n_cluster == 'auto':
            S = []
            for n in tqdm(range(2,10)):
                self.km = KMeans(n_clusters=n,n_init=10, max_iter=1000,n_jobs=-1)
                self.km.fit(X1)
                S.append(silhouette_score(X1, self.km.labels_))
            N_opt = np.argmax(np.array(S)) + 2
            print("Optimal number of clusters is {} with Silhouette Score {}".format(N_opt,max(S)))
            self.km = KMeans(n_clusters=N_opt,n_init=50, max_iter=1000,n_jobs=-1)
            return self.km.fit(X1)
        else:
            self.km = KMeans(n_clusters=self.n_cluster,n_init=50, max_iter=1000,n_jobs=-1)
            return self.km.fit(X1)
        
    def predict(self, X, y=None):
        if self.scale:
            sd = StandardScaler()
            X1 = sd.fit_transform(X)
        else:
            X1 = X.copy()
        return self.km.predict(X1)

Putting everything in a pipeline and final prediction:

review_cluster_pipeline = Pipeline([('tfidf',tfidf_transform(key='Review')), 
                                   ('lsa',lsa()),
                                   ('cluster',km_cluster())])

review_cluster_pipeline.fit(X=rev_data)

review_cluster_pipeline.predict(rev_data)

Hope you will find this article helpful.

If you have any suggestions, do let me know

Wednesday, 29 January 2020

Pipeline: Automating routine works in ML