Optimizers in Neural Networks: Optimize Parameter Updates

2024-12-02 722 words 4 minutes

Contents

In deep learning, optimizers are the secret heroes that guide models toward better performance. Whether you’re training a small neural network or a massive transformer, selecting the right optimizer is very crucial. This post explores four widely-used optimizers SGD, Momentum, RMSprop, and Adam and how they adapt parameter updates for optimal learning..

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent or SGD in short, is the most basic optimizer. It can be thought as the vanilla optimizer. It updates parameters based on the gradient of the loss function for each mini-batch.

$$ w_{t+1} = w_t - \alpha \cdot \nabla f(w_t) $$

Where:

$\alpha$ (alpha) is the learning rate.
$\nabla f(w_t)$ is the gradient of the loss at step $t$.

These two terms together form the step size.

While simple and effective, it often struggles in complex loss landscapes and causes oscillation in directions with steep gradients and results slow convergence in flat regions.

To address these challenges, various other optimizers tweak step_size parameters either α or ∇f(wt) or both according to their approach to the problem.

Momentum: Smoothing the Ride

SGD often struggles in complex loss landscapes, oscillating in steep directions and crawling through flat regions. Momentum addresses this by smoothing updates, much like adding inertia to a moving object.

Instead of directly using the gradient at each step, Momentum builds an exponentially weighted moving average of past gradients:

$$ Momentum_{w_t} = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot \nabla f(w_t) $$

where

β1: Momentum coefficient (typically 0.9).

and new gradient is

$$ w_{t+1} = w_t - \alpha \cdot \text{Momentum}_{w_t} $$

By relying on past gradients, Momentum reduces oscillations and accelerates convergence in valleys, helping the optimizer focus on long-term trends.

RMSprop: Adaptive Learning Rates

While momentum adjust gradients for smoothed updates it uses fixed learning rate to all parameters. RMSprop(Root Mean Square Propagation) on the other hand, focuses on learning rate based on the magnitude of recent gradients for each parameter.

$$ v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot (\nabla f(w_t))^2 $$

$$ \text{RMSprop}_{\alpha_t} = \frac{\alpha}{\sqrt{v_t + \epsilon}} $$

where

vt: as the exponentially weighted average of squared gradients:
β2: Smoothing coefficient for squared gradients (typically 0.999).
ϵ: Epsilon, a small constant to prevent division by zero.

and the new gradient is

$$ w_{t+1} = w_t - \text{RMSprop}_{\alpha_t} \cdot \nabla f(w_t) $$

RMSprop adapts the learning rate for each parameter independently, effectively ‘slowing down’ updates for parameters with large gradients and ‘speeding up’ those with smaller ones.

Adam: The Best of Both Worlds

Adam combines the advantages of Momentum and RMSprop by using both momentum and adaptive learning rates concept. One challenge in early training is that Momentum and RMSprop’s moving averages start biased toward zero. Adam addresses this with bias correction, ensuring accurate and stable updates from the start. Bias correction ensures that the moving averages of gradients and squared gradients start closer to their true values. Without this, updates in early iterations might be too small, slowing down learning.

Recall that

Momentum update:

$$ Momentum_{w_t} = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot \nabla f(w_t) $$

RMSprop update:

$$ \text{RMSprop}_{\alpha_t} = \frac{\alpha}{\sqrt{v_t + \epsilon}} $$

and bias corrected versions are

$$ BcMomentum_{w_t} = \frac{\beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot \nabla f(w_t)}{1 - \beta_1^t} $$

$$ \text{BcRMSprop}_{\alpha_t} = \frac{\alpha}{\sqrt{\frac{v_t}{1 - \beta_2^t} + \epsilon}} $$

where

${1 - \beta_1^t}$ and ${1 - \beta_2^t}$ are bias correction terms. These corrections adjust the moving averages to approximate their true expected values more accurately in early iterations. As t increases, these factors diminish.

and the new gradient is

$$ w_{t+1} = w_t - BcRMSprop_{\alpha_t} \cdot BcMomentum_{w_t} $$

Adam combines speed (Momentum) and adaptability (RMSprop) and it is effective across a wide range of tasks.

Choosing the Right Optimizer

Optimizers are the driving force behind efficient neural network training. From SGD’s simplicity to Adam’s adaptability, each tool has its place. By mastering these tools, you’ll not only optimize performance but also deepen your understanding of how neural networks learn.

For practical applications, start simple. SGD or Momentum might suffice for smaller models, while Adam shines in more complex tasks. The key is experimentation. Understanding these optimizers empowers you to fine-tune performance for your specific needs. Also, as the field evolves, keep an eye on new optimizers. Staying informed ensures you’re always equipped with the best tools for your problem.