Neural Networks: From Fundamentals to Modern AI · Regularization — how to avoid overfitting

Layer Normalization: when BatchNorm fails and how to replace it

Regularization — how to avoid overfitting

Introduction

Layer Normalization (Ba, Kiros, Hinton 2016) was proposed as an answer to BatchNorm weaknesses on sequential tasks and small batches. Instead of normalizing along the batch dimension, LN normalizes along the feature dimension within a single example — μ and σ are computed independently per sample, so batch size does not matter and train/eval modes are identical. The math is a small change, but architectural consequences are huge: LN is the standard in transformers (Vaswani et al. 2017, GPT, BERT, Llama), Vision Transformers, ConvNeXt, in RNNs (LSTM with LN), and wherever batch is small, sequences are variable length, or training is distributed. This lesson covers: the exact LN formula and differences from BN, the placement of LN in a transformer block (Pre-LN vs Post-LN, decision from On Layer Normalization in the Transformer Architecture, Xiong et al. 2020), variants (RMSNorm Zhang & Sennrich 2019 — used in Llama 2/3), and when LN itself can be pathological (too small d_model, missing γ/β, numerical ε).