Overview
- Concept of sklearn pipeline
- Designing pipelines
- Implementing pipelines for better results
Introduction
Machine learning demands quite a lot of works on preparing the data. The complexity of the task varies from situation to situation. A dataset may contain missing values, irregularities and other issues such as high dimensionality of data. Data modeling, in such situations, is difficult and data preparation is a task that can easily take away a considerable amount of time if done manually. Not only that, there are several routine works that are required to be performed before the actual prediction is done by the model. This is where the concept of pipelines comes into the picture because pipelines can be called just like normal API calls and each pipeline can perform repetitive tasks efficiently. Pipeline is available in the Scikit-Learn package and if used properly, it can make the codes more readable and efficient.
In this article, we would see how we can create a simple pipeline using the sklearn package. Then we would also see how to build a slightly more complex pipeline for other tasks.
Pipelines
A pipeline, as the name suggests, is essentially a workflow process that takes in input data and finally, either builds a model or predicts the outcome simply sends out a transformed data. This is achieved by means of a base class that essentially takes in a data frame as input and in all intermediate operations, it tries to perform intermediate operations assuming that each operator has a “fit” and a “transform” method. The last operator can be a transformer (e.g. StandardScaler)or a predictor (e.g. SVM).
The interesting part of using the pipeline is that users can supply separate sets of parameters for all of its intermediate operators. This makes pipelines trainable through hyperparameter tuning operators such as GridSearchCV. Hence, it is preferable to use pipelines in ML while working with python.
Examples
Even though many examples are available on the Internet, it is always good to have more examples with explanations for better understanding. The first example is a simple one where simple preprocessing is done prior to building a model.
Building an ANN model using pipeline
For this example, the Abalone dataset is used from UCI Machine Learning repository. Here, the objective is to predict the age of abalone by predicting the number of rings (the last variable). The dataset has only one categorical variable, i.e. sex. A snapshot is shown below.
To solve this problem, a few packages and classes are required to be loaded in the python environment and loading of those packages is shown below:
import pandas as pd from sklearn.neural_network import MLPRegressor from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.base import BaseEstimator, TransformerMixin from sklearn.model_selection import train_test_split
This is a regression problem and hence multilayer perceptron (MLP) will be used for regression. However, MLP requires that the inputs are normalized. Hence, prior to feeding the data to MLP architecture, the data is required to be normalized. In this example, MinMax normalization is applied. Moreover, categorical data are required to be one hot encoded. So, a simple pipeline for ANN regression will have 3 operators:
- One hot encoder (for categorical data)
- MinMax normalization / standardization (for numeric data)
- MLP regression model
The codes pertaining to custom transformers are shown below:
class dummify(BaseEstimator, TransformerMixin): def __init__(self): # No initialization pass # self.y = y def fit(self, X, y=None): # Returns nothing essentially return self def transform(self, X, y=None): # get_dummies() creates One-Hot-Encoding return pd.get_dummies(X) # of categorical data only leaving numeric data def fit_transform(self, X, y=None): return self.fit(X, y).transform(X) class normalize(BaseEstimator, TransformerMixin): def __init__(self): pass # self.y = y def fit(self, X, y=None): # Fits MinMax() normalization return minmax_norm.fit(X) def transform(self, X, y=None): # Transforms numeric data using MinMax norm return minmax_norm.transform(X) def fit_transform(self, X, y=None): return self.fit(X, y).transform(X)
The next task is to combine the ANN regression method with the above transformers. Using the pipeline class in sklearn, it is really simple. One needs to simply put instances of these classes in a sequence such that the last instance belongs to the model’s instance. The code below shows the way of creating a pipeline.
ANN_reg = MLPRegressor(hidden_layer_sizes=(10,20),learning_rate='adaptive', solver='adam',max_iter=2000) ann_reg_pipeline = Pipeline([('dummy',dummify()), ('norm',normalize()), ('model',ANN_reg)])
Once the pipeline is created, training the final model is very easy. A pipeline has both ‘fit’ and ‘predict’ method to train the model inside the pipeline and to make predictions based on the trained model. The code below shows how to split the dataset into a training dataset and test dataset and use the training dataset to train the pipeline and use the pipeline for making predictions for the test data.
x_train, x_test, y_train, y_test = train_test_split(data.loc[:,'sex':'shell_weight'],data['rings'],test_size=0.3, random_state=1234) ann_reg_pipeline.fit(X=x_train, y=y_train)
Finally making a prediction using the pipeline:
ann_reg_pipeline.predict(X=x_test) RMSE = (sum((y_test-ann_reg_pipeline.predict(X=x_test))**2)/len(y_test))**(0.5) print(RMSE) 2.167094347024997
All codes together:
path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data' import pandas as pd from sklearn.neural_network import MLPRegressor from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.base import BaseEstimator, TransformerMixin from sklearn.model_selection import train_test_split data = pd.read_csv(path, header=None) columns = ['sex','length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight','rings'] data.columns = columns data.head() minmax_norm = MinMaxScaler() class dummify(BaseEstimator, TransformerMixin): def __init__(self): # No initialization pass # self.y = y def fit(self, X, y=None): # Returns nothing essentially return self def transform(self, X, y=None): # get_dummies() creates One-Hot-Encoding return pd.get_dummies(X) # of categorical data only leaving numeric data def fit_transform(self, X, y=None): return self.fit(X,y).transform(X) class normalize(BaseEstimator, TransformerMixin): def __init__(self): pass # self.y = y def fit(self, X, y=None): # Fits MinMax() normalization return minmax_norm.fit(X) def transform(self, X, y=None): # Transforms numeric data using MinMax norm return minmax_norm.transform(X) def fit_transform(self, X, y=None): return self.fit(X,y).transform(X) ANN_reg = MLPRegressor(hidden_layer_sizes=(10,20),learning_rate='adaptive', solver='adam',max_iter=2000) ann_reg_pipeline = Pipeline([('dummy',dummify()), ('norm',normalize()), ('model',ANN_reg)]) x_train, x_test, y_train, y_test = train_test_split(data.loc[:,'sex':'shell_weight'],data['rings'],test_size=0.3, random_state=1234) ann_reg_pipeline.fit(X=x_train,y=y_train) ann_reg_pipeline.predict(X=x_test) RMSE = (sum((y_test-ann_reg_pipeline.predict(X=x_test))**2)/len(y_test))**(0.5) print(RMSE)
A more complex pipeline (automated document clustering):
In text mining, document clustering plays an important role. Document clustering involves a few important steps like document cleaning, creation of document-term matrix, truncated singular value decomposition and finally clustering.
Let us import the important packages:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer from sklearn.decomposition import TruncatedSVD from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score from sklearn.preprocessing import StandardScaler from tqdm import tqdm
Codes for TFIDF based document-term matrix creation:
tfidf = TfidfTransformer() tf = CountVectorizer(stop_words='english') tsvd = TruncatedSVD(n_components=500) class tfidf_transform(BaseEstimator, TransformerMixin): # TFIDF matrix creation def __init__(self, key): # Key is the column containing texts self.key = key def fit(self, X, y=None): return tfidf.fit(tf.fit_transform(X[self.key])) def transform(self,X, y=None): return tfidf.transform(tf.transform(X[self.key])) def fit_transform(self, X, y=None): return tfidf.fit_transform(tf.fit_transform(X[self.key]))
Codes for LSA using TFIDF matrix:
class lsa(BaseEstimator, TransformerMixin): # Latent Semantic Analysis def __init__(self, dim_frac=0.5, threshold_diff_svd=0.5): # In SVD singular values are arranged in decreasing order. 'threshold_diff_svd' is kept to identify an elbow location from where singular values are tapered off. self.dim_frac = dim_frac self.threshold_diff_svd = threshold_diff_svd def fit(self, X, y=None): N = X.shape[1]//20 tsvd = TruncatedSVD(n_components=N) tsvd.fit(X) S = tsvd.singular_values_ d = self.threshold_diff_svd + 2 i = 0 while d >= self.threshold_diff_svd: d = S[i] - S[i+1] i += 1 print("Number of components is {}".format(i)) tsvd = TruncatedSVD(n_components=i) return tsvd.fit(X) def transform(self, X, y=None): return tsvd.transform(X) def fit_transform(self, X, y=None): return tsvd.fit_transform(X)
Codes for KMeans clustering:
class km_cluster(BaseEstimator, TransformerMixin): # KMeans clustering def __init__(self, n_cluster = 'auto', scale= True): self.n_cluster = n_cluster self.scale = scale def fit(self, X, y=None): if self.scale: sd = StandardScaler() X1 = sd.fit_transform(X) else: X1 = X.copy() if self.n_cluster == 'auto': S = [] for n in tqdm(range(2,10)): self.km = KMeans(n_clusters=n,n_init=10, max_iter=1000,n_jobs=-1) self.km.fit(X1) S.append(silhouette_score(X1, self.km.labels_)) N_opt = np.argmax(np.array(S)) + 2 print("Optimal number of clusters is {} with Silhouette Score {}".format(N_opt,max(S))) self.km = KMeans(n_clusters=N_opt,n_init=50, max_iter=1000,n_jobs=-1) return self.km.fit(X1) else: self.km = KMeans(n_clusters=self.n_cluster,n_init=50, max_iter=1000,n_jobs=-1) return self.km.fit(X1) def predict(self, X, y=None): if self.scale: sd = StandardScaler() X1 = sd.fit_transform(X) else: X1 = X.copy() return self.km.predict(X1)
Putting everything in a pipeline and final prediction:
review_cluster_pipeline = Pipeline([('tfidf',tfidf_transform(key='Review')), ('lsa',lsa()), ('cluster',km_cluster())]) review_cluster_pipeline.fit(X=rev_data) review_cluster_pipeline.predict(rev_data)
Hope you will find this article helpful.
If you have any suggestions, do let me know