GRU (Gated Recurrent Unit)
How it works
GRU has two gates: (1) reset gate (r), which controls how much of the previous state is retained when computing the new state candidate; (2) update gate (z), which decides the proportion between old state and new candidate. No separate cell state unlike LSTM.
Problem solved
LSTM solved the vanishing gradient problem but at the cost of complexity (3 gates, 2 states). GRU simplifies the architecture to 2 gates and 1 state, retaining the ability to model long-range dependencies at lower computational cost.
Key mechanisms
Strengths & limitations
Components
Gate deciding how much the previous hidden state h_{t-1} should influence the computation of the new state candidate h~_t. A value close to 0 = ignore the past; close to 1 = take all of the past into account. Sigmoid(W_r · [h_{t-1}, x_t] + b_r).
Gate interpolating between the old state h_{t-1} and candidate h~_t. h_t = (1 - z_t) * h_{t-1} + z_t * h~_t. A value close to 0 = keep the old state (long-term memory); close to 1 = adopt the new state. Sigmoid(W_z · [h_{t-1}, x_t] + b_z).
Proposed new hidden state computed as tanh(W · [r_t * h_{t-1}, x_t] + b). Contains 'new' information extracted from the current input x_t composed with the (selectively chosen by the reset gate) past.
Implementation
GRU step t depends on the state from step t-1, so the entire sequence cannot be computed in parallel on GPU. For long sequences (T>1000), training is significantly slower than Transformer, despite fewer parameters.
Despite selective-memory gates, GRU degrades on very long dependencies. The memory dissolves over iterations and the model 'forgets' older context. A typical practical limit is 100-500 steps of effective dependencies.
While GRU handles vanishing gradients better than vanilla RNN, the gradient can explode for very deep (multi-layer) or very long sequences.
Evolution
The paper 'Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation' (arXiv 1406.1078) introduces GRU as a simplified variant of LSTM for machine translation. The first publication using an encoder-decoder with gated recurrent units.
Chung, Gulcehre, Cho, Bengio publish 'Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling' (arXiv 1412.3555). They show GRU and LSTM achieve comparable performance on language and music modeling, with GRU slightly faster to train. The paper becomes the standard reference when choosing between GRU and LSTM.
The 'Attention Is All You Need' paper (Vaswani et al., NeurIPS 2017) introduces the Transformer, which dramatically outperforms RNN/GRU in scaling thanks to parallel sequence processing on GPU. From this point GRU is gradually displaced in NLP, though it remains relevant in edge AI and streaming applications.
Mamba (Gu & Dao, 2023) and other State-Space Models show that gated recurrent models with linear complexity in sequence length can rival the Transformer on long contexts. While technically not GRUs, they inherit the idea of selective gated state-space memory. GRU gains new historical value as the first widely deployed proof of gated recurrence working.
Compute bottleneck
GRU's main bottleneck is inherent sequentiality: step t depends on the state from step t-1, so steps cannot be parallelized. For a sequence of length T we have T sequential matrix operations of O(d²). This contrasts with the Transformer, which processes the whole sequence in parallel at the cost of O(T² · d) attention. For short sequences (T<100) GRU can be faster, but for long sequences (T>1000) the Transformer wins despite quadratic complexity in T.
Execution paradigm
Parallelism
Hardware requirements
GRU maps well to GPU through batching (parallelism along batch dim) and cuDNN-optimized kernels for multi-layer GRU. But the sequence must be processed sequentially along the T axis, limiting speedup vs Transformer.
GRU works well on CPU for small models (especially for a single batch during streaming inference). Less dependent on parallelism than Transformer.
Small GRUs (a few dozen to a few hundred hidden neurons) run even on ARM Cortex-M microcontrollers via TFLite Micro. Few parameters + no attention make GRU a natural choice for edge AI in streaming signal processing (speech, IoT sensors).