Robots Atlas>ROBOTS ATLAS
Architecture

GRU (Gated Recurrent Unit)

2014ActiveUpdated: 23 June 2026Published
Key innovation
Simplified LSTM variant with only two gates (reset and update), achieving comparable performance with fewer parameters and faster training.
Category
Architecture
Abstraction level
Building block
Operation level
Architecture blockLayerTrainingInference
Use cases
Language models on edge devices (fewer parameters)Time series and sensor dataSpeech and audio processingSequence classification (sentiment, intent)Baseline models before the transformer era

How it works

GRU has two gates: (1) reset gate (r), which controls how much of the previous state is retained when computing the new state candidate; (2) update gate (z), which decides the proportion between old state and new candidate. No separate cell state unlike LSTM.

Problem solved

LSTM solved the vanishing gradient problem but at the cost of complexity (3 gates, 2 states). GRU simplifies the architecture to 2 gates and 1 state, retaining the ability to model long-range dependencies at lower computational cost.

Key mechanisms

Reset gate r_t = sigmoid(W_r · [h_{t-1}, x_t]) — controls how much of the previous state to ignore when computing the candidate
Update gate z_t = sigmoid(W_z · [h_{t-1}, x_t]) — interpolates the proportion between old state and new state candidate
Candidate hidden state h~_t = tanh(W · [r_t * h_{t-1}, x_t]) — proposed new state modulated by the reset gate
Linear interpolation h_t = (1 - z_t) * h_{t-1} + z_t * h~_t — final mix of old and new states
No separate cell state — a single hidden state combines long- and short-term memory
Backpropagation Through Time (BPTT) — the standard gradient-learning algorithm
Sigmoid + tanh activations — differentiable gates in [0, 1] for sigmoid and [-1, 1] for tanh

Strengths & limitations

Strengths
Simpler than LSTM — 2 gates vs 3, one state vs two
Fewer parameters (~25% fewer than LSTM for the same hidden size)
Faster training — fewer operations per time step
Performance comparable to LSTM on most sequence tasks (Chung et al. 2014)
Resistant to the vanishing-gradient problem thanks to the identity path through the update gate
Good choice for small datasets (fewer parameters to fit)
Easy to deploy on edge / microcontrollers (TFLite Micro, ONNX Runtime)
Limitations
Sequential processing — impossible to parallelize a whole sequence on GPU (vs Transformer)
Difficulty with very long dependencies (>1000 steps) — memory degrades despite gates
Lack of interpretability — gates act as black boxes, hard to explain why the model remembers a given piece of info
Superseded by Transformer in mainstream NLP since 2017 — most foundation models use attention, not recurrence
Does not scale to billions of parameters — RNN/GRU do not scale well to large models
Worse performance than LSTM on some tasks (mainly those with very long dependencies and large datasets)
Fewer public pretrained checkpoints than for Transformers (HuggingFace Hub)

Components

Reset GateSelective short-term memory filter — lets the model 'start fresh' at the right point in the sequence.

Gate deciding how much the previous hidden state h_{t-1} should influence the computation of the new state candidate h~_t. A value close to 0 = ignore the past; close to 1 = take all of the past into account. Sigmoid(W_r · [h_{t-1}, x_t] + b_r).

Update GateMain long-term memory mechanism — analog of the combined forget + input gate in LSTM.

Gate interpolating between the old state h_{t-1} and candidate h~_t. h_t = (1 - z_t) * h_{t-1} + z_t * h~_t. A value close to 0 = keep the old state (long-term memory); close to 1 = adopt the new state. Sigmoid(W_z · [h_{t-1}, x_t] + b_z).

Candidate Hidden StateIntermediate representation combining fresh input with modulated history — input signal to the update gate.

Proposed new hidden state computed as tanh(W · [r_t * h_{t-1}, x_t] + b). Contains 'new' information extracted from the current input x_t composed with the (selectively chosen by the reset gate) past.

Implementation

Implementation pitfalls
Sequential processing prevents full parallelismHigh

GRU step t depends on the state from step t-1, so the entire sequence cannot be computed in parallel on GPU. For long sequences (T>1000), training is significantly slower than Transformer, despite fewer parameters.

Fix:Use the cuDNN GRU implementation (optimized for multi-layer GRU + batch parallelism), use smaller sequences with truncated BPTT, or move from GRU to Transformer/Mamba for large sequences.
Difficulty with very long dependencies (>1000 steps)Medium

Despite selective-memory gates, GRU degrades on very long dependencies. The memory dissolves over iterations and the model 'forgets' older context. A typical practical limit is 100-500 steps of effective dependencies.

Fix:Hierarchical RNN (several layers at different time scales), attention over recent hidden states (Bahdanau attention), or migration to architectures with linear complexity (Mamba, Linear Attention) for very long sequences.
Exploding gradients in deep/long GRU networksMedium

While GRU handles vanishing gradients better than vanilla RNN, the gradient can explode for very deep (multi-layer) or very long sequences.

Fix:Gradient clipping (standard: clip-by-norm with threshold 1.0-5.0), layer normalization (LayerNorm on each GRU layer), Xavier/Glorot initialization.

Evolution

Original paper · 2014 · Kyunghyun Cho
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio
2014
Introduction of GRU by Cho et al.
Inflection point

The paper 'Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation' (arXiv 1406.1078) introduces GRU as a simplified variant of LSTM for machine translation. The first publication using an encoder-decoder with gated recurrent units.

2014
Empirical comparison of LSTM vs GRU (Chung et al.)

Chung, Gulcehre, Cho, Bengio publish 'Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling' (arXiv 1412.3555). They show GRU and LSTM achieve comparable performance on language and music modeling, with GRU slightly faster to train. The paper becomes the standard reference when choosing between GRU and LSTM.

2017
Transformer supersedes RNN/GRU in mainstream NLP
Inflection point

The 'Attention Is All You Need' paper (Vaswani et al., NeurIPS 2017) introduces the Transformer, which dramatically outperforms RNN/GRU in scaling thanks to parallel sequence processing on GPU. From this point GRU is gradually displaced in NLP, though it remains relevant in edge AI and streaming applications.

2023
Mamba and SSM — revived interest in recurrence

Mamba (Gu & Dao, 2023) and other State-Space Models show that gated recurrent models with linear complexity in sequence length can rival the Transformer on long contexts. While technically not GRUs, they inherit the idea of selective gated state-space memory. GRU gains new historical value as the first widely deployed proof of gated recurrence working.

Compute bottleneck

Sequential execution of time steps

GRU's main bottleneck is inherent sequentiality: step t depends on the state from step t-1, so steps cannot be parallelized. For a sequence of length T we have T sequential matrix operations of O(d²). This contrasts with the Transformer, which processes the whole sequence in parallel at the cost of O(T² · d) attention. For short sequences (T<100) GRU can be faster, but for long sequences (T>1000) the Transformer wins despite quadratic complexity in T.

Execution paradigm

Primary mode
Conditional
Activation pattern
Input dependent

Parallelism

Parallelism level
Sequential
Scope
TrainingInference

Hardware requirements

Good fit

GRU maps well to GPU through batching (parallelism along batch dim) and cuDNN-optimized kernels for multi-layer GRU. But the sequence must be processed sequentially along the T axis, limiting speedup vs Transformer.

Good fit

GRU works well on CPU for small models (especially for a single batch during streaming inference). Less dependent on parallelism than Transformer.

aiArchitecture.profile.fitLevel.excellent

Small GRUs (a few dozen to a few hundred hidden neurons) run even on ARM Cortex-M microcontrollers via TFLite Micro. Few parameters + no attention make GRU a natural choice for edge AI in streaming signal processing (speech, IoT sensors).