Neural Networks: From Fundamentals to Modern AI · Training in practice: optimizers and diagnostics

Gradient descent geometrically: loss surface, learning rate and mini-batch SGD

Training in practice: optimizers and diagnostics

Introduction

Training a network is a walk over a high-dimensional loss surface in parameter space. The gradient ∇L(θ) points in the direction of steepest ascent at the current point, so the update θ ← θ − η·∇L(θ) moves in the direction of steepest descent. All the magic (and all the trouble) of this procedure hides in three things: the shape of the loss surface (ravines, plateaus, saddles), the value of the learning rate η, and how we estimate the gradient — on the full dataset, a mini-batch or a single example. This lesson presents gradient descent as a locally optimal first-order algorithm, explains why mini-batch SGD is the standard (memory/noise/efficiency tradeoff), and why the non-convex loss surface of deep models is not as catastrophic as intuition suggests: there are exponentially more saddle points than true local minima, and even when we get stuck, the minimum found is usually good enough.