Neural Networks: From Fundamentals to Modern AI · Training in practice: optimizers and diagnostics
Gradient histograms, dead neurons and gradient clipping
Training in practice: optimizers and diagnostics
Introduction
A gradient is not a single number — it is a tensor with thousands or millions of components. Looking at the aggregate (norm, mean) hides pathologies: dead neurons (gradient = 0 for 99% of the batch), exploding heads (a few neurons with grad 1e6, the rest 1e-3), bias drift. The lesson teaches you to look at distributions. It covers (1) per-layer gradient histograms in TensorBoard / W&B — a healthy network has similar distributions N(0, σ_l) with σ_l varying smoothly across layers; (2) the "dead ReLU" phenomenon (Maas et al. 2013): a neuron whose pre-activation is always negative → gradient always 0 → weights are not updated, the neuron is lost permanently; (3) gradient clipping (Pascanu et al. 2013): clip_grad_norm_ caps the global gradient norm at a threshold (typically 1.0), rescuing from exploding gradients in RNNs and transformers; (4) clip vs grad scaling, (5) per-parameter vs global clipping. Finally: weight histograms — what to do when weights grow exponentially (no weight decay) or collapse to zero (too aggressive weight decay).