Neural Networks: From Fundamentals to Modern AI · Regularization — how to avoid overfitting

Batch Normalization: the internal covariate shift problem and its solution

Regularization — how to avoid overfitting

Introduction

Batch Normalization (Ioffe & Szegedy 2015) changed how deep networks are trained. Under the slogan "we solved internal covariate shift" the authors proposed a layer that normalizes activations along the batch dimension — subtracts the batch mean, divides by the batch standard deviation, then adds learned γ and β for scale and shift. In practice BN allowed training deeper networks with larger learning rates, lower sensitivity to initialization, and a mild regularizing effect from batch-statistic noise. Ten years later it is still standard in CNNs (ResNet, EfficientNet, RegNet), although the "internal covariate shift" diagnosis has been challenged (Santurkar et al. 2018, "How Does Batch Normalization Help Optimization?") in favor of a geometric interpretation: BN smooths the loss surface. This lesson walks through the exact BN algorithm in train and eval modes (running statistics), the batch-size dependency problem, ResNet compatibility through residual connections, and weak points (small batch = unstable statistics, variable-length sequences, distributed training).