The development of any machine learning model involves a rigorous experimental process that follows the idea-experimentation-evaluation cycle.
The above cycle is repeated several times until satisfactory performance levels are achieved. The “experimentation” phase involves both the coding and training stages of the machine learning model. As models become more complex and are trained on many larger datasets, the training time inevitably lengthens. As a result, training a large deep neural network can be extremely slow.
Fortunately for data science practitioners, there are several techniques to speed up the training process, including:
- Transfer learning.
- Weight initializationlike Glorot or He initialization.
- Batch normalization for training data.
- Choose one reliable activation function.
- Use a faster optimizer.
Although all the techniques I have highlighted are important, in this article I will focus deeply on the last point. I will describe several algorithms for optimizing neural network parameters, highlighting both their advantages and limitations.
In the last section of this article, I will present a visualization displaying the comparison between the discussed optimization algorithms.
For practical implementationall the code used in this article is accessible in this GitHub repository:
Traditionally, batch gradient descent is considered the default choice for the optimization method in neural networks.