Neural Networks: From Fundamentals to Modern AI · Sequences: RNN, LSTM and GRU

Vanishing and exploding gradients in deep RNNs

Sequences: RNN, LSTM and GRU

Introduction

A classical RNN, despite the elegance of parameter sharing, has a serious practical defect: the gradient flowing backwards through T steps of BPTT typically either vanishes to zero (vanishing) or explodes exponentially (exploding). Bengio, Simard and Frasconi 1994 ("Learning long-term dependencies with gradient descent is difficult") framed this as a fundamental problem: ∂h_T/∂h_1 = ∏_{t=2..T} ∂h_t/∂h_{t-1}, a product of T-1 Jacobians. Pascanu, Mikolov, Bengio 2013 ("On the difficulty of training recurrent neural networks") gave a precise analysis: if the largest singular value σ_max(∂h_t/∂h_{t-1}) < 1, the product tends to zero exponentially with T (vanishing); if > 1 — it grows exponentially (exploding). Vanishing means that distant early tokens do not "teach" the weights — the model cannot learn long-range dependencies. Exploding gives gradients with norm 10^6, NaNs in the weights, a total fail. Two practical remedies: (1) gradient clipping (Pascanu et al. 2013) — clip the gradient when its norm exceeds a threshold, simple and effective against exploding; (2) better initialization — orthogonal init (Saxe et al. 2014), identity init for ReLU RNN (IRNN, Le Jaitly Hinton 2015). The third, most radical remedy is gating — LSTM (Hochreiter Schmidhuber 1997) introduces a CEC (constant error carousel), where the gradient flows backwards through plain addition instead of multiplication, sidestepping the problem. This leads to the next lessons on LSTM and GRU.