Sunday, 2 August 2020

Understanding gradient descent

In machine learning, numerical optimization plays a critical role. Take the case of simple logistic regression. In logistic regression, model parameters are obtained using the MLE concept (primarily). But, parameter estimation is not as simple as in the case of linear regression. This happens because, at one step of this estimation process, equations of parameters are formed which are not in closed form. In such cases, we are bound to use numerical methods to get plausible solutions. Similarly, when we try to minimize a loss function in an ML algorithm, in most cases, the loss function turns out to be non-convex. In those situations also we need to adopt numerical optimization techniques. In literature, we find several numerical optimization techniques with relative merits and demerits. But, when speedy execution is needed, the most common method of optimization is gradient-based methods. If looked carefully, the entire set of deep learning architectures is mainly relying on some or other form of gradient-based optimization techniques. Hence, in this post, two such basic but important gradient-based optimization techniques are going to be discussed. These techniques are:

  • Steepest Descent with variable step size
  • Gradient Descent with, mostly, constant step size

To understand the methods rather easily, we would look at a 3D surface which is essentially convex in nature. Let us assume that a 3D surface is created by the function:
$$ f(x,y) = 2x^2 + 5y^2 - 1000 $$
This is a very simple function that is convex in nature and it has a global minimum at x=y=0. The function, if plotted will look like a parabolic surface as shown below.


In the case of gradient descent, we need to first calculate the gradient of the function at the current point. A gradient is, essentially, a vector which stays perpendicular to the surface at the point where the gradient is calculated. It points towards the direction of movement which would maximize the function value in the neighborhood location. Let us find the gradient vector first for the given surface. 
$$\frac{\partial f(x,y)}{\partial x} = 4x$$
$$\frac{\partial f(x,y)}{\partial y} = 10y$$
$$\nabla = [4x, 10y]_{x=x^{(c)}, y=y^{(c)}}$$
The same surface can be shown as a contour plot and when the gradient is calculated, it points in the outward direction as shown in the figure below. The gradient is calculated at [5.05, 5.45]

If we move along the direction of the arrow, the value of the function will increase. But we need to decrease the value. Hence, instead of moving along the direction of the gradient, we need to move in the opposite direction. That is why the term gradient descent is given. So, now we know, from the initial starting point, along which direction we should move to lower the value of the function. Look at the figure below:

Steepest Descent
As we move along the line in the opposite direction of the gradient, we can keep calculating the gradients and find the respective gradient vectors at those points. It would be seen that the angle between the original gradient vector and the other gradient vectors, as calculated along the line, would increase continuously. Have a look at the movement of gradient vectors as we move along the line.


From the concept of vector algebra, we know that two vectors would be independent of each other if they are perpendicular to each other. This is where the line search comes in. Along this line, it is possible to find a particular point where the gradient vector will become exactly perpendicular to the previous gradient vector. Why is this point important? It is because, till this point, if we proceed, we would get the maximum reduction of the function. Beyond this point, if we proceed, the amount of reduction will decrease. Line search is an iterative process because finding the exact location of the next point where the gradient is absolutely perpendicular to the previous gradient is extremely expensive computationally. Hence, approximate methods are used. In 1969, Peter Wolfe suggested two conditions to find the step length.
$$f(x_i+\alpha_ip_i)<f(x_i)+c\alpha \nabla f_i^Tp_i$$
$$\nabla f(x_i+\alpha p_i)^Tp_i\geq f(x_i)+c_1\nabla f_i^Tp_i$$
In the above conditions, c1>c but both c and c1 are less than 1. The hyperparameter c is chosen to be quite small. $p_i$ is the direction of the step. If the loss minimization is done by Newton Method, this condition serves well. Goldstein, proposed another condition which can be used in line search. The condition is:
$$f(x_i)+(1-c)\alpha_i \nabla f_i^Tp_i \leq f(x_i+\alpha_ip_i)\leq f(x_i)+c\alpha_i \nabla f_i^Tp_i$$

Thus, in deepest descent, both gradient and step size are required to be calculated. If the function to be minimized is a simple function as in the present case, deepest descent will quickly reach the minima within a few iterations. The solution can be seen in the figure below.

But in case of functions having several local minima, this method will become slow and will follow a very zigzag pattern before reaching the solution. That is why we usually keep the step size constant and small (we call it learning rate) so that the descent part becomes much more smoother. Equation of gradient descent is given below:
$$\theta^n = \theta^o - \eta \nabla L$$
where $\eta$ is the learning rate and $\nabla L$ is the gradient at the current location. In this problem, when eta is kept low (at 0.02) the gradient descent reached the minima in a very smooth manner as shown in the figure.

In deep learning or in any complex machine learning algorithm, if gradient descent is used (e.g. GBM, LightGBM, Xgboost etc.), the deepest descent is usually avoided because the error surface is very uneven and, in many situations, are ill-conditioned. In those situations, gradient descent with (mostly) static learning rate used. Particularly in Deep Learning, different modified optimizers are used which uses gradient information a lot (e.g. Rmsprop, Adam, Adamax etc.). 

I hope you will find this post informative.

EM Algorithm and its usage (Part 2) EM algorithm is discussed in the previous post related to the tossing of coins. The same algorithm is q...