When training your models, there are quite a few optimisers to use. Since many researchers adopt Adam optimiser, there are also reports the instability of the optimiser in some cases. Nevertheless, it is worth to note that Adam is a variant of the basic stochastic gradient descent (SGD), which turn, is a variant of the famous gradient descent optimization algorithm.
Gradient descent
GD is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using GD, one takes steps proportional to the negative of the gradient. (or approximate gradient of the function at the current point).
The algorithm
Gradient descent is based on the observation that if the multi-variable function F(x)
is defined and differentiable in a neighbourhood of a point a
, then F(x) decreases fastest if one goes from a
in the opposite direction of the gradient of F at a
, i.e. negative gradient:
It follows that if
Example of a single variable function with python:
With GD, we can approximate the value.
next_x = 6 # We start the search at x=6
gamma = 0.01 # Step size multiplier
precision = 0.00001 # Desired precision of result
max_iters = 10000 # Maximum number of iterations
# Derivative function
def df(x):
return 4 * x**3 - 9 * x**2
for _i in range(max_iters):
current_x = next_x
next_x = current_x - gamma * df(current_x) #update with negative gradient
step = next_x - current_x
if abs(step) <= precision:
break
print("Minimum at ", next_x)
# The output for the above will be something like
# "Minimum at 2.2499646074278457"
Gradient descent with jupyter
I demonstrate the use of GD algorithm while optimizing a loss function while training the classification problem.
Conclusion
Important general paradigm when
- continuously parameterized hypothesis
- the error can be differentiated with respect to the hypothesis parameters
The key practical problems are:
- converging to a local minimum can be quite slow
- if there are multiple local minima, then there is no guarantee that the procedure will find the global minimum. (Notice: The gradient descent algorithm can work with other error definitions and will not have a global minimum. If we use the sum of squares error, this is not a problem.)