Neural Networks: From Fundamentals to Modern AI · Backpropagation: How a Network Learns

Weight initialization: Xavier and He — how the start decides gradient flow

Backpropagation: How a Network Learns

Introduction

Before batch-norm and before ResNets, training deep networks balanced on a knife edge for one simple reason: bad weight init causes activations to either explode or vanish layer by layer, and the gradient travels the other way with the same effect. Glorot and Bengio (2010) showed that for symmetric activations like tanh, weights should have variance 2/(fan_in + fan_out) — this is "Xavier init". He et al. (2015) for ReLU adapted the formula to 2/fan_in — because ReLU "kills" half of the activations, so the variance must be twice as large to preserve scale. This lesson covers the derivation: why Var(Wx) = fan_in * Var(W) * Var(x) and how this product grows across layers; how Karpathy in "makemore part 3" diagnoses dead ReLU via activation histograms; what exactly torch.nn.init.kaiming_normal_ vs uniform_ do; how init interacts with normalization (BatchNorm, LayerNorm) and why for transformers the residual-stream "init-scale" is the key today. We also show why zero init does not work (symmetry breaking), too-large init gives tanh/sigmoid saturation, and too-small init gives total gradient collapse in deep networks.