Neural Networks: From Fundamentals to Modern AI · Training in practice: optimizers and diagnostics

Momentum and Adam: adaptive learning rates and when to use them

Training in practice: optimizers and diagnostics

Introduction

Pure SGD has two problems: it oscillates in ravines (high Hessian condition number) and stalls on plateaus (small gradient → small step). The remedies are two orthogonal ideas: momentum (Polyak 1964, Nesterov 1983) — velocity accumulation along gradient directions, and adaptive per-parameter learning rates (AdaGrad 2011, RMSprop 2012, Adam 2014). Adam combines both and is the default today for NLP, transformers and RL. SGD with momentum remains preferred for ResNets and image classification — empirically it achieves ~1pp better generalization. The lesson derives the Adam equations, shows why bias correction of first and second moment matters, covers AdamW as a weight-decay fix (Loshchilov & Hutter 2019), and gives practical intuition for when to pick which optimizer. The central question: why Adam wins for transformers but loses for ResNets on ImageNet.