Neural Networks: From Fundamentals to Modern AI · Training in practice: optimizers and diagnostics

Systematic diagnostics: overfit single batch, init loss, learning curves

Training in practice: optimizers and diagnostics

Introduction

A non-converging training run is not a signal to "try another optimizer". It is a signal to walk through a diagnostic checklist. Andrej Karpathy in "A Recipe for Training Neural Networks" (2019) formalized a procedure that reduces hundreds of possible failure modes to a few distinguishable classes. The lesson covers five canonical tests: (1) sanity-check loss at random initialization — comparing to the theoretical value for the task (e.g. log(K) for K-class classification, log(2)≈0.693 for binary cross-entropy with class balance); (2) overfit a single batch — forcing zero train loss on 1–2 batches as proof that the gradient pipeline works; (3) input-independent baseline — removing the input (zeroing pixels) and checking the model behaves at chance level; (4) gradient check via finite differences (deprecated in the autograd era, but still useful for custom layers); (5) interpreting train-loss vs val-loss. It also shows the common traps: unscaled inputs, unshuffled batches, label leakage into features, gradients not zeroed between steps.