1. At each time step t the network receives two inputs: the current vector x_t and the previous hidden state h_{t-1}. 2. Computes the new state: h_t = tanh(W_xh ยท x_t + W_hh ยท h_{t-1} + b_h). The same weight matrices W_xh, W_hh are shared across all time steps. 3. Optional per-step output: y_t = W_hy ยท h_t + b_y (e.g. next-token probability). 4. Training via BPTT (backpropagation through time): the network is unrolled for T steps and gradients are propagated backward. The loss gradient w.r.t. parameters is summed across all steps. 5. Truncated BPTT: propagation horizon limited to k steps to reduce cost. 6. Vanishing-gradient problem: over long sequences, gradients are multiplied T times by W_hh โ exponential decay or explosion โ inability to learn long-range dependencies.
Standard neural networks (MLP, CNN) process fixed-size inputs and have no built-in memory โ they cannot model dependencies between elements of a sequence (e.g. words in a sentence, audio samples). RNN solves this by introducing a hidden state carried across time steps, enabling variable-length sequence modelling and temporal dependencies.
During BPTT gradients are multiplied by the weight matrix at each step โ for long sequences they vanish or explode. Gradient clipping mitigates explosion, but vanishing requires LSTM/GRU.
RNN must process token T before T+1 โ no possibility of parallelization along the time axis. For long sequences this is the main training speed bottleneck.
On Penn Treebank vanilla RNN reached ~120 perplexity and LSTM ~80, showing the superiority of gated variants. In WMT 2014 machine translation RNN/LSTM seq2seq models reached ~30 BLEU, trailing Transformers by a few points. RNNs remain a standard in on-device speech models and in some time-series forecasting systems.
RNN matrix operations (Wxh, Whh) are accelerated by CUBLAS. cuDNN fused kernel for RNN/LSTM/GRU gives 5-10ร speedup vs naive PyTorch implementation.