Neural Networks: From Fundamentals to Modern AI · Backpropagation: How a Network Learns

Backprop Ninja: manual backward through cross-entropy, linear, tanh and batch-norm

Backpropagation: How a Network Learns

Introduction

In "makemore part 4" Karpathy shows that a framework like PyTorch shields you from what backward actually does — and that until you derive gradients by hand through cross-entropy with fused softmax, a linear layer, tanh, and batch-norm, you have no right to call yourself a deep learning engineer. This lesson walks through exactly those four operations: why dL/dlogits = (p - y_true) / N is so elegant only because softmax and CE fuse analytically; how a linear layer Y = X*W + b yields three gradients (dW = X^T * dY, db = sum(dY, axis=0), dX = dY * W^T) and why shapes must agree; why tanh'(x) = 1 - tanh(x)^2 lets you compute backward without re-touching the input; and how batch-norm — with its mu, sigma, x_hat and affine gamma, beta — has five gradient paths that Karpathy derives line by line. Along the way we cover numerical gradcheck (central difference with step h ~ 1e-4), common bugs (forgetting /N, in-place ops breaking autograd), and the fact that once you derive this by hand, a training bug becomes recognizable from a tensor shape rather than from a stack trace.