GRU has two gates: (1) reset gate (r), which controls how much of the previous state is retained when computing the new state candidate; (2) update gate (z), which decides the proportion between old state and new candidate. No separate cell state unlike LSTM.
LSTM solved the vanishing gradient problem but at the cost of complexity (3 gates, 2 states). GRU simplifies the architecture to 2 gates and 1 state, retaining the ability to model long-range dependencies at lower computational cost.
GRU processes tokens sequentially — computations cannot be parallelized along the time axis as in Transformer. Training on long sequences is slower despite fewer parameters.
Despite reset/update gates GRU still struggles with gradient propagation over >1000 steps. For such sequences Transformer or state-space models (Mamba, S4) are a better choice.
Simplified alternative to LSTM for NMT tasks.
Studies showed GRU and LSTM have comparable effectiveness on most tasks.
GRU remains popular for on-device and real-time tasks.
GRU gate matrix operations are accelerated by CUBLAS on GPU. cuDNN provides an optimized LSTM/GRU implementation with fused kernels.
For small GRU models (embedded NLP, IoT) CPU is sufficient — the sequential nature of GRU does not penalize CPU as much as Transformers.