Sunday, 2 August 2020

Understanding gradient descent

In machine learning, numerical optimization plays a critical role. Take the case of simple logistic regression. In logistic regression, model parameters are obtained using the MLE concept (primarily). But, parameter estimation is not as simple as in the case of linear regression. This happens because, at one step of this estimation process, equations of parameters are formed which are not in closed form. In such cases, we are bound to use numerical methods to get plausible solutions. Similarly, when we try to minimize a loss function in an ML algorithm, in most cases, the loss function turns out to be non-convex. In those situations also we need to adopt numerical optimization techniques. In literature, we find several numerical optimization techniques with relative merits and demerits. But, when speedy execution is needed, the most common method of optimization is gradient-based methods. If looked carefully, the entire set of deep learning architectures is mainly relying on some or other form of gradient-based optimization techniques. Hence, in this post, two such basic but important gradient-based optimization techniques are going to be discussed. These techniques are:

  • Steepest Descent with variable step size
  • Gradient Descent with, mostly, constant step size

To understand the methods rather easily, we would look at a 3D surface which is essentially convex in nature. Let us assume that a 3D surface is created by the function:
$$ f(x,y) = 2x^2 + 5y^2 - 1000 $$
This is a very simple function that is convex in nature and it has a global minimum at x=y=0. The function, if plotted will look like a parabolic surface as shown below.


In the case of gradient descent, we need to first calculate the gradient of the function at the current point. A gradient is, essentially, a vector which stays perpendicular to the surface at the point where the gradient is calculated. It points towards the direction of movement which would maximize the function value in the neighborhood location. Let us find the gradient vector first for the given surface. 
$$\frac{\partial f(x,y)}{\partial x} = 4x$$
$$\frac{\partial f(x,y)}{\partial y} = 10y$$
$$\nabla = [4x, 10y]_{x=x^{(c)}, y=y^{(c)}}$$
The same surface can be shown as a contour plot and when the gradient is calculated, it points in the outward direction as shown in the figure below. The gradient is calculated at [5.05, 5.45]

If we move along the direction of the arrow, the value of the function will increase. But we need to decrease the value. Hence, instead of moving along the direction of the gradient, we need to move in the opposite direction. That is why the term gradient descent is given. So, now we know, from the initial starting point, along which direction we should move to lower the value of the function. Look at the figure below:

Steepest Descent
As we move along the line in the opposite direction of the gradient, we can keep calculating the gradients and find the respective gradient vectors at those points. It would be seen that the angle between the original gradient vector and the other gradient vectors, as calculated along the line, would increase continuously. Have a look at the movement of gradient vectors as we move along the line.


From the concept of vector algebra, we know that two vectors would be independent of each other if they are perpendicular to each other. This is where the line search comes in. Along this line, it is possible to find a particular point where the gradient vector will become exactly perpendicular to the previous gradient vector. Why is this point important? It is because, till this point, if we proceed, we would get the maximum reduction of the function. Beyond this point, if we proceed, the amount of reduction will decrease. Line search is an iterative process because finding the exact location of the next point where the gradient is absolutely perpendicular to the previous gradient is extremely expensive computationally. Hence, approximate methods are used. In 1969, Peter Wolfe suggested two conditions to find the step length.
$$f(x_i+\alpha_ip_i)<f(x_i)+c\alpha \nabla f_i^Tp_i$$
$$\nabla f(x_i+\alpha p_i)^Tp_i\geq f(x_i)+c_1\nabla f_i^Tp_i$$
In the above conditions, c1>c but both c and c1 are less than 1. The hyperparameter c is chosen to be quite small. $p_i$ is the direction of the step. If the loss minimization is done by Newton Method, this condition serves well. Goldstein, proposed another condition which can be used in line search. The condition is:
$$f(x_i)+(1-c)\alpha_i \nabla f_i^Tp_i \leq f(x_i+\alpha_ip_i)\leq f(x_i)+c\alpha_i \nabla f_i^Tp_i$$

Thus, in deepest descent, both gradient and step size are required to be calculated. If the function to be minimized is a simple function as in the present case, deepest descent will quickly reach the minima within a few iterations. The solution can be seen in the figure below.

But in case of functions having several local minima, this method will become slow and will follow a very zigzag pattern before reaching the solution. That is why we usually keep the step size constant and small (we call it learning rate) so that the descent part becomes much more smoother. Equation of gradient descent is given below:
$$\theta^n = \theta^o - \eta \nabla L$$
where $\eta$ is the learning rate and $\nabla L$ is the gradient at the current location. In this problem, when eta is kept low (at 0.02) the gradient descent reached the minima in a very smooth manner as shown in the figure.

In deep learning or in any complex machine learning algorithm, if gradient descent is used (e.g. GBM, LightGBM, Xgboost etc.), the deepest descent is usually avoided because the error surface is very uneven and, in many situations, are ill-conditioned. In those situations, gradient descent with (mostly) static learning rate used. Particularly in Deep Learning, different modified optimizers are used which uses gradient information a lot (e.g. Rmsprop, Adam, Adamax etc.). 

I hope you will find this post informative.

Wednesday, 29 January 2020

Pipeline: Automating routine works in ML


Overview

  • Concept of sklearn pipeline
  • Designing pipelines
  • Implementing pipelines for better results

Introduction


Machine learning demands quite a lot of works on preparing the data. The complexity of the task varies from situation to situation. A dataset may contain missing values, irregularities and other issues such as high dimensionality of data. Data modeling, in such situations, is difficult and data preparation is a task that can easily take away a considerable amount of time if done manually. Not only that, there are several routine works that are required to be performed before the actual prediction is done by the model. This is where the concept of pipelines comes into the picture because pipelines can be called just like normal API calls and each pipeline can perform repetitive tasks efficiently. Pipeline is available in the Scikit-Learn package and if used properly, it can make the codes more readable and efficient.


In this article, we would see how we can create a simple pipeline using the sklearn package. Then we would also see how to build a slightly more complex pipeline for other tasks.
Pipelines

A pipeline, as the name suggests, is essentially a workflow process that takes in input data and finally, either builds a model or predicts the outcome simply sends out a transformed data. This is achieved by means of a base class that essentially takes in a data frame as input and in all intermediate operations, it tries to perform intermediate operations assuming that each operator has a “fit” and a “transform” method. The last operator can be a transformer (e.g. StandardScaler)or a predictor (e.g. SVM).


The interesting part of using the pipeline is that users can supply separate sets of parameters for all of its intermediate operators. This makes pipelines trainable through hyperparameter tuning operators such as GridSearchCV. Hence, it is preferable to use pipelines in ML while working with python.

Examples

Even though many examples are available on the Internet, it is always good to have more examples with explanations for better understanding. The first example is a simple one where simple preprocessing is done prior to building a model.

Building an ANN model using pipeline

For this example, the Abalone dataset is used from UCI Machine Learning repository. Here, the objective is to predict the age of abalone by predicting the number of rings (the last variable). The dataset has only one categorical variable, i.e. sex. A snapshot is shown below.


To solve this problem, a few packages and classes are required to be loaded in the python environment and loading of those packages is shown below:

import pandas as pd
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split

This is a regression problem and hence multilayer perceptron (MLP) will be used for regression. However, MLP requires that the inputs are normalized. Hence, prior to feeding the data to MLP architecture, the data is required to be normalized. In this example, MinMax normalization is applied. Moreover, categorical data are required to be one hot encoded. So, a simple pipeline for ANN regression will have 3 operators:
  1. One hot encoder (for categorical data)
  2. MinMax normalization / standardization (for numeric data)
  3. MLP regression model
The codes pertaining to custom transformers are shown below:
class dummify(BaseEstimator, TransformerMixin):
    def __init__(self):         # No initialization
        pass
        # self.y = y

    def fit(self, X, y=None):   # Returns nothing essentially
        return self

    def transform(self, X, y=None): # get_dummies() creates One-Hot-Encoding
        return pd.get_dummies(X)    # of categorical data only leaving numeric data

    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)

class normalize(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        # self.y = y
    
    def fit(self, X, y=None):           # Fits MinMax() normalization 
        return minmax_norm.fit(X)

    def transform(self, X, y=None):     # Transforms numeric data using MinMax norm
        return minmax_norm.transform(X)

    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)

The next task is to combine the ANN regression method with the above transformers. Using the pipeline class in sklearn, it is really simple. One needs to simply put instances of these classes in a sequence such that the last instance belongs to the model’s instance. The code below shows the way of creating a pipeline.
ANN_reg = MLPRegressor(hidden_layer_sizes=(10,20),learning_rate='adaptive',
                       solver='adam',max_iter=2000)

ann_reg_pipeline = Pipeline([('dummy',dummify()),
                             ('norm',normalize()),
                             ('model',ANN_reg)])

Once the pipeline is created, training the final model is very easy. A pipeline has both ‘fit’ and ‘predict’ method to train the model inside the pipeline and to make predictions based on the trained model. The code below shows how to split the dataset into a training dataset and test dataset and use the training dataset to train the pipeline and use the pipeline for making predictions for the test data.
x_train, x_test, y_train, y_test = train_test_split(data.loc[:,'sex':'shell_weight'],data['rings'],test_size=0.3, random_state=1234)

ann_reg_pipeline.fit(X=x_train, y=y_train)

Finally making a prediction using the pipeline:

ann_reg_pipeline.predict(X=x_test)
RMSE = (sum((y_test-ann_reg_pipeline.predict(X=x_test))**2)/len(y_test))**(0.5)
print(RMSE)
2.167094347024997

All codes together:

path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'

import pandas as pd
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split

data = pd.read_csv(path, header=None)

columns = ['sex','length','diameter','height','whole_weight','shucked_weight','viscera_weight','shell_weight','rings']

data.columns = columns

data.head()

minmax_norm = MinMaxScaler()

class dummify(BaseEstimator, TransformerMixin):
    def __init__(self):         # No initialization
        pass
        # self.y = y

    def fit(self, X, y=None):   # Returns nothing essentially
        return self

    def transform(self, X, y=None):     # get_dummies() creates One-Hot-Encoding
        return pd.get_dummies(X)        # of categorical data only leaving numeric data

    def fit_transform(self, X, y=None):
        return self.fit(X,y).transform(X)

class normalize(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
        # self.y = y
    
    def fit(self, X, y=None):           # Fits MinMax() normalization
        return minmax_norm.fit(X)

    def transform(self, X, y=None):     # Transforms numeric data using MinMax norm
        return minmax_norm.transform(X)

    def fit_transform(self, X, y=None):
        return self.fit(X,y).transform(X)

ANN_reg = MLPRegressor(hidden_layer_sizes=(10,20),learning_rate='adaptive',
                       solver='adam',max_iter=2000)

ann_reg_pipeline = Pipeline([('dummy',dummify()),
                             ('norm',normalize()),
                             ('model',ANN_reg)])

x_train, x_test, y_train, y_test = train_test_split(data.loc[:,'sex':'shell_weight'],data['rings'],test_size=0.3, random_state=1234)

ann_reg_pipeline.fit(X=x_train,y=y_train)

ann_reg_pipeline.predict(X=x_test)

RMSE = (sum((y_test-ann_reg_pipeline.predict(X=x_test))**2)/len(y_test))**(0.5)
print(RMSE)

A more complex pipeline (automated document clustering):

In text mining, document clustering plays an important role. Document clustering involves a few important steps like document cleaning, creation of document-term matrix, truncated singular value decomposition and finally clustering.

Let us import the important packages:

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer 
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

Codes for TFIDF based document-term matrix creation:

tfidf = TfidfTransformer()                                                                                        tf = CountVectorizer(stop_words='english')
tsvd = TruncatedSVD(n_components=500)


class tfidf_transform(BaseEstimator, TransformerMixin): # TFIDF matrix creation
    def __init__(self, key): # Key is the column containing texts
        self.key = key
    
    def fit(self, X, y=None):
        return tfidf.fit(tf.fit_transform(X[self.key]))
    
    def transform(self,X, y=None):
        return tfidf.transform(tf.transform(X[self.key]))
    
    def fit_transform(self, X, y=None):
        return tfidf.fit_transform(tf.fit_transform(X[self.key]))

Codes for LSA using TFIDF matrix:

class lsa(BaseEstimator, TransformerMixin):             # Latent Semantic Analysis
    def __init__(self, dim_frac=0.5, threshold_diff_svd=0.5): # In SVD singular values are arranged in decreasing order. 'threshold_diff_svd' is kept to identify an elbow location from where singular values are tapered off.

        self.dim_frac = dim_frac
        self.threshold_diff_svd = threshold_diff_svd
    
    def fit(self, X, y=None):
        N = X.shape[1]//20
        tsvd = TruncatedSVD(n_components=N)
        tsvd.fit(X)
        S = tsvd.singular_values_
        d = self.threshold_diff_svd + 2
        i = 0
        while d >= self.threshold_diff_svd:
            d = S[i] - S[i+1]
            i += 1
        print("Number of components is {}".format(i))
        tsvd = TruncatedSVD(n_components=i)
        return tsvd.fit(X)
    
    def transform(self, X, y=None):
        return tsvd.transform(X)
    
    def fit_transform(self, X, y=None):
        return tsvd.fit_transform(X)

Codes for KMeans clustering:

class km_cluster(BaseEstimator, TransformerMixin):        # KMeans clustering
    def __init__(self, n_cluster = 'auto', scale= True):
        self.n_cluster = n_cluster
        self.scale = scale
    
    def fit(self, X, y=None):
        if self.scale:
            sd = StandardScaler()
            X1 = sd.fit_transform(X)
        else:
            X1 = X.copy()
        if self.n_cluster == 'auto':
            S = []
            for n in tqdm(range(2,10)):
                self.km = KMeans(n_clusters=n,n_init=10, max_iter=1000,n_jobs=-1)
                self.km.fit(X1)
                S.append(silhouette_score(X1, self.km.labels_))
            N_opt = np.argmax(np.array(S)) + 2
            print("Optimal number of clusters is {} with Silhouette Score {}".format(N_opt,max(S)))
            self.km = KMeans(n_clusters=N_opt,n_init=50, max_iter=1000,n_jobs=-1)
            return self.km.fit(X1)
        else:
            self.km = KMeans(n_clusters=self.n_cluster,n_init=50, max_iter=1000,n_jobs=-1)
            return self.km.fit(X1)
        
    def predict(self, X, y=None):
        if self.scale:
            sd = StandardScaler()
            X1 = sd.fit_transform(X)
        else:
            X1 = X.copy()
        return self.km.predict(X1)

Putting everything in a pipeline and final prediction:


review_cluster_pipeline = Pipeline([('tfidf',tfidf_transform(key='Review')), 
                                   ('lsa',lsa()),
                                   ('cluster',km_cluster())])

review_cluster_pipeline.fit(X=rev_data)

review_cluster_pipeline.predict(rev_data)

Hope you will find this article helpful.

If you have any suggestions, do let me know

Maximum Likelihood Estimation (MLE): An important statistical tool for parameter estimation

Parameter estimation is critical for learning patterns within the data. Before the advancements in computation power, researchers used to do...