Machine Learning · Regression

Gradient descent

Regression

Introduction

Gradient descent (GD) is the foundation of everything we optimize in ML — from linear regression to LLMs. In this lesson we decompose it: the update rule w_{t+1} = w_t − η·∇L(w_t), why we step "opposite to the gradient" (Cauchy 1847), how to pick the learning rate η (too small → slow, too large → divergence), the differences between batch / SGD / mini-batch, why SGD noise helps non-convex optimization, the Hessian condition number as the regulator of convergence speed, and modern improvements (momentum, AdaGrad, Adam). We return to linear regression to show that GD has a closed-form solution only in this special case.