Neural Networks: From Fundamentals to Modern AI · Training in practice: optimizers and diagnostics

LR schedules: step decay, cosine annealing, warmup

Training in practice: optimizers and diagnostics

Introduction

A constant learning rate over the entire training is rarely optimal. Early on we want large steps (exploration, escaping saddles), mid-training moderate ones (convergence), and at the end tiny ones (precisely settling into a minimum). That is why almost every modern pipeline trains with a learning rate schedule — a function η(t) varying over time. The lesson covers three main families: step decay (the ResNet classic, Krizhevsky 2014), cosine annealing (Loshchilov & Hutter 2017, the LLM standard today), and warmup (linear ramp from 0, indispensable for transformers and very large batches). It also covers cyclical LR (Smith 2017) and the one-cycle policy. Finally it explains why warmup rescues Adam training on transformers: in early steps √v̂ is underestimated and the effective step can be huge — warmup linearly damps that until Adam collects a sensible second-moment estimate.