Optimizers in Neural Networks: Optimize Parameter Updates

In deep learning, optimizers are the secret heroes that guide models toward better performance. Whether you’re training a small neural network or a massive transformer, selecting the right optimizer is very crucial. This post explores four widely-used optimizers SGD, Momentum, RMSprop, and Adam and how they adapt parameter updates for optimal learning..

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent or SGD in short, is the most basic optimizer. It can be thought as the vanilla optimizer. It updates parameters based on the gradient of the loss function for each mini-batch.

w_t+1=w_t−α⋅∇f(w_t)

Where:

α(alpha) is the learning rate.
∇f(wt) is the gradient of the loss at step t

These two terms together form the step_size.

While simple and effective, it often struggles in complex loss landscapes and causes oscillation in directions with steep gradients and results slow convergence in flat regions.

To address these challenges, various other optimizers tweak step_size parameters either α or ∇f(wt) or both according to their approach to the problem.

Momentum: Smoothing the Ride

SGD often struggles in complex loss landscapes, oscillating in steep directions and crawling through flat regions. Momentum addresses this by smoothing updates, much like adding inertia to a moving object.

Instead of directly using the gradient at each step, Momentum builds an exponentially weighted moving average of past gradients:

Momentum_w_t=β₁⋅m_t−1+(1−β₁)⋅∇f(w_t)

where

β₁: Momentum coefficient (typically 0.9).

and new gradient is

w_t+1=w_t−α⋅Momentum_w_t

By relying on past gradients, Momentum reduces oscillations and accelerates convergence in valleys, helping the optimizer focus on long-term trends.

RMSprop: Adaptive Learning Rates

While momentum adjust gradients for smoothed updates it uses fixed learning rate to all parameters. RMSprop(Root Mean Square Propagation) on the other hand, focuses on learning rate based on the magnitude of recent gradients for each parameter.

v_t=β₂⋅v_t−1+(1−β₂)⋅(∇f(w_t))²

RMSprop_α_t=α/√(v_t+ϵ)

where

v_t: as the exponentially weighted average of squared gradients:
β₂: Smoothing coefficient for squared gradients (typically 0.999).
ϵ: Epsilon, a small constant to prevent division by zero.

and the new gradient is

w_t+1=w_t−RMSprop_α_t⋅∇f(w_t)

RMSprop adapts the learning rate for each parameter independently, effectively ‘slowing down’ updates for parameters with large gradients and ‘speeding up’ those with smaller ones.

Adam: The Best of Both Worlds

Adam combines the advantages of Momentum and RMSprop by using both momentum and adaptive learning rates concept. One challenge in early training is that Momentum and RMSprop’s moving averages start biased toward zero. Adam addresses this with bias correction, ensuring accurate and stable updates from the start. Bias correction ensures that the moving averages of gradients and squared gradients start closer to their true values. Without this, updates in early iterations might be too small, slowing down learning.

Recall that

Momentum update:

Momentum_w_t=β₁⋅m_t−1+(1−β₁)⋅∇f(w_t)

RMSprop update:

RMSprop_α_t=α/√(v_t+ϵ)

and bias corrected versions are

bc_Momentum_w_t=β₁⋅m_t−1+(1−β₁)⋅∇f(w_t) / (1−β₁^t) bc_RMSprop_α_t=α/√((v_t/ (1−β₂^t) )+ϵ)

where

(1−β₁^t) and (1−β₂^t) are bias correction terms. These corrections adjust the moving averages to approximate their true expected values more accurately in early iterations. As t increases, these factors diminish.

and the new gradient is

w_t+1=w_t−bc_RMSprop_α_t⋅bc_Momentum_w_t

Adam combines speed (Momentum) and adaptability (RMSprop) and it is effective across a wide range of tasks.

Choosing the Right Optimizer

Optimizers are the driving force behind efficient neural network training. From SGD’s simplicity to Adam’s adaptability, each tool has its place. By mastering these tools, you’ll not only optimize performance but also deepen your understanding of how neural networks learn.

For practical applications, start simple. SGD or Momentum might suffice for smaller models, while Adam shines in more complex tasks. The key is experimentation. Understanding these optimizers empowers you to fine-tune performance for your specific needs. Also, as the field evolves, keep an eye on new optimizers. Staying informed ensures you’re always equipped with the best tools for your problem.