Neural Networks: From Fundamentals to Modern AI · Training in practice: optimizers and diagnostics

Bias-variance tradeoff and diagnosing underfitting vs overfitting

Training in practice: optimizers and diagnostics

Introduction

The generalization gap (val_loss − train_loss) is a central concept in statistical learning. The classical decomposition of prediction error (Geman et al. 1992): MSE = bias² + variance + irreducible noise. High bias = systematic errors, the model is "too simple" (underfitting). High variance = oversensitivity to training data, the model is "too complex" (overfitting). The lesson covers: (1) the formal bias-variance decomposition for regression and classification; (2) diagnostics from learning curves — how to distinguish underfit from overfit by curve shape; (3) classical bias-reduction methods (larger model, more features, more capacity) and variance-reduction (regularization, augmentation, more data, ensembling); (4) double descent (Belkin et al. 2019) — the paradox that very large models return to good generalization despite perfect fit on train; (5) implications for the LLM era — overparametrized regime, lottery ticket hypothesis, benign overfitting. The lesson finishes with a decision toolkit: when you see a 10pp test gap, which concrete pipeline changes to apply in each of the two regimes.