Neural Networks: From Fundamentals to Modern AI · Regularization — how to avoid overfitting

Weight decay and L2 regularization — penalizing large weights

Regularization — how to avoid overfitting

Introduction

Weight decay is the oldest and most widely used form of regularization in deep learning. The idea is simple: we add a term to the loss function that penalizes the weights for being large — most often as L2 (sum of squared weights) scaled by a coefficient λ. The consequences are twofold: mathematically it forces smaller weights (bias-variance tradeoff shifted toward bias), practically it improves generalization and optimization stability. The devil is in the details — it turns out that "weight decay = L2" is true only for pure SGD. In Adam and other adaptive optimizers classical L2 is in fact harmful, which is why AdamW was introduced (Loshchilov & Hutter 2017, "Decoupled Weight Decay Regularization"). This lesson goes through L2 geometry (pulling weights toward zero, soft constraint), the comparison with L1 (sparsity), the interaction with normalization, and typical λ scales in real pipelines (10⁻⁴ for CV, 10⁻¹ for NLP in some transformers).