Architecture

Recurrent Neural Network

Historical

How it works

1. At each time step t the network receives two inputs: the current vector x_t and the previous hidden state h_{t-1}. 2. Computes the new state: h_t = tanh(W_xh · x_t + W_hh · h_{t-1} + b_h). The same weight matrices W_xh, W_hh are shared across all time steps. 3. Optional per-step output: y_t = W_hy · h_t + b_y (e.g. next-token probability). 4. Training via BPTT (backpropagation through time): the network is unrolled for T steps and gradients are propagated backward. The loss gradient w.r.t. parameters is summed across all steps. 5. Truncated BPTT: propagation horizon limited to k steps to reduce cost. 6. Vanishing-gradient problem: over long sequences, gradients are multiplied T times by W_hh → exponential decay or explosion → inability to learn long-range dependencies.

Problem solved

Standard neural networks (MLP, CNN) process fixed-size inputs and have no built-in memory — they cannot model dependencies between elements of a sequence (e.g. words in a sentence, audio samples). RNN solves this by introducing a hidden state carried across time steps, enabling variable-length sequence modelling and temporal dependencies.

Key mechanisms

Hidden state carried across time steps

Weights shared across time — the same matrix at every step

Update: h_t = tanh(W_xh · x_t + W_hh · h_{t-1} + b)

Backpropagation through time (BPTT) — unrolling the network in time

Truncated BPTT — limiting propagation horizon to a window

Topology variants: one-to-many, many-to-one, many-to-many, encoder-decoder

Gradient clipping as the standard training-stabilization technique

Strengths & limitations

Strengths

✓Natural handling of variable-length sequences

✓Linear complexity in sequence length (vs quadratic in Transformers)

✓Low memory footprint in online inference

✓Few parameters thanks to weight sharing

✓Streaming-friendly — one token at a time, no context buffer

✓Well suited to edge devices and low-latency tasks

Limitations

✗Vanishing-gradient problem — hard to learn long-term dependencies

✗Exploding-gradient problem — requires clipping

✗Sequential training — hard to parallelize across sequence length

✗Short effective memory (a few dozen steps) in the vanilla variant

✗Poor scaling to very large models vs the Transformer

✗Weaker than Transformers on long text contexts

Implementation

Implementation pitfalls

Vanishing and exploding gradientsMedium

During BPTT gradients are multiplied by the weight matrix at each step — for long sequences they vanish or explode. Gradient clipping mitigates explosion, but vanishing requires LSTM/GRU.

Sequential processing prevents full parallelismMedium

RNN must process token T before T+1 — no possibility of parallelization along the time axis. For long sequences this is the main training speed bottleneck.