In deep learning, optimizers are the secret heroes that guide models toward better performance. Whether you’re training a small neural network or a massive transformer, selecting the right optimizer is very crucial. This post explores four widely-used optimizers SGD, Momentum, RMSprop, and Adam and how they adapt parameter updates for optimal learning..
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent or SGD in short, is the most basic optimizer. It can be thought as the vanilla optimizer. It updates parameters based on the gradient of the loss function for each mini-batch.
wt+1=wt−α⋅∇f(wt)
Where:
- α(alpha) is the learning rate.
- ∇f(wt) is the gradient of the loss at step t
These two terms together form the step_size.
While simple and effective, it often struggles in complex loss landscapes and causes oscillation in directions with steep gradients and results slow convergence in flat regions.
To address these challenges, various other optimizers tweak step_size parameters either α or ∇f(wt) or both according to their approach to the problem.
Momentum: Smoothing the Ride
SGD often struggles in complex loss landscapes, oscillating in steep directions and crawling through flat regions. Momentum addresses this by smoothing updates, much like adding inertia to a moving object.
Instead of directly using the gradient at each step, Momentum builds an exponentially weighted moving average of past gradients:
Momentum_wt=β1⋅mt−1+(1−β1)⋅∇f(wt)
where
- β1: Momentum coefficient (typically 0.9).
and new gradient is
wt+1=wt−α⋅Momentum_wt
By relying on past gradients, Momentum reduces oscillations and accelerates convergence in valleys, helping the optimizer focus on long-term trends.
RMSprop: Adaptive Learning Rates
While momentum adjust gradients for smoothed updates it uses fixed learning rate to all parameters. RMSprop(Root Mean Square Propagation) on the other hand, focuses on learning rate based on the magnitude of recent gradients for each parameter.
vt=β2⋅vt−1+(1−β2)⋅(∇f(wt))2
RMSprop_αt=α/√(vt+ϵ)
where
- vt: as the exponentially weighted average of squared gradients:
- β2: Smoothing coefficient for squared gradients (typically 0.999).
- ϵ: Epsilon, a small constant to prevent division by zero.
and the new gradient is
wt+1=wt−RMSprop_αt⋅∇f(wt)
RMSprop adapts the learning rate for each parameter independently, effectively ‘slowing down’ updates for parameters with large gradients and ‘speeding up’ those with smaller ones.
Adam: The Best of Both Worlds
Adam combines the advantages of Momentum and RMSprop by using both momentum and adaptive learning rates concept. One challenge in early training is that Momentum and RMSprop’s moving averages start biased toward zero. Adam addresses this with bias correction, ensuring accurate and stable updates from the start. Bias correction ensures that the moving averages of gradients and squared gradients start closer to their true values. Without this, updates in early iterations might be too small, slowing down learning.
Recall that
Momentum update:
Momentum_wt=β1⋅mt−1+(1−β1)⋅∇f(wt)
RMSprop update:
RMSprop_αt=α/√(vt+ϵ)
and bias corrected versions are
bc_Momentum_wt=β1⋅mt−1+(1−β1)⋅∇f(wt) / (1−β1t) bc_RMSprop_αt=α/√((vt/ (1−β2t) )+ϵ)
where
- (1−β1t) and (1−β2t) are bias correction terms. These corrections adjust the moving averages to approximate their true expected values more accurately in early iterations. As t increases, these factors diminish.
and the new gradient is
wt+1=wt−bc_RMSprop_αt⋅bc_Momentum_wt
Adam combines speed (Momentum) and adaptability (RMSprop) and it is effective across a wide range of tasks.
Choosing the Right Optimizer
Optimizers are the driving force behind efficient neural network training. From SGD’s simplicity to Adam’s adaptability, each tool has its place. By mastering these tools, you’ll not only optimize performance but also deepen your understanding of how neural networks learn.
For practical applications, start simple. SGD or Momentum might suffice for smaller models, while Adam shines in more complex tasks. The key is experimentation. Understanding these optimizers empowers you to fine-tune performance for your specific needs. Also, as the field evolves, keep an eye on new optimizers. Staying informed ensures you’re always equipped with the best tools for your problem.