Architecture

GRU (Gated Recurrent Unit)

2014ActiveUpdated: 23 June 2026Published

Key innovation

Simplified LSTM variant with only two gates (reset and update), achieving comparable performance with fewer parameters and faster training.

How it works

GRU has two gates: (1) reset gate (r), which controls how much of the previous state is retained when computing the new state candidate; (2) update gate (z), which decides the proportion between old state and new candidate. No separate cell state unlike LSTM.

Problem solved

LSTM solved the vanishing gradient problem but at the cost of complexity (3 gates, 2 states). GRU simplifies the architecture to 2 gates and 1 state, retaining the ability to model long-range dependencies at lower computational cost.

Key mechanisms

Reset gate r_t = sigmoid(W_r · [h_{t-1}, x_t]) — controls how much of the previous state to ignore when computing the candidate

Update gate z_t = sigmoid(W_z · [h_{t-1}, x_t]) — interpolates the proportion between old state and new state candidate

Candidate hidden state h~_t = tanh(W · [r_t * h_{t-1}, x_t]) — proposed new state modulated by the reset gate

Linear interpolation h_t = (1 - z_t) * h_{t-1} + z_t * h~_t — final mix of old and new states

No separate cell state — a single hidden state combines long- and short-term memory

Backpropagation Through Time (BPTT) — the standard gradient-learning algorithm

Sigmoid + tanh activations — differentiable gates in [0, 1] for sigmoid and [-1, 1] for tanh

Strengths & limitations

Strengths

✓Simpler than LSTM — 2 gates vs 3, one state vs two

✓Fewer parameters (~25% fewer than LSTM for the same hidden size)

✓Faster training — fewer operations per time step

✓Performance comparable to LSTM on most sequence tasks (Chung et al. 2014)

✓Resistant to the vanishing-gradient problem thanks to the identity path through the update gate

✓Good choice for small datasets (fewer parameters to fit)

✓Easy to deploy on edge / microcontrollers (TFLite Micro, ONNX Runtime)

Limitations

✗Sequential processing — impossible to parallelize a whole sequence on GPU (vs Transformer)

✗Difficulty with very long dependencies (>1000 steps) — memory degrades despite gates

✗Lack of interpretability — gates act as black boxes, hard to explain why the model remembers a given piece of info

✗Superseded by Transformer in mainstream NLP since 2017 — most foundation models use attention, not recurrence

✗Does not scale to billions of parameters — RNN/GRU do not scale well to large models

✗Worse performance than LSTM on some tasks (mainly those with very long dependencies and large datasets)

✗Fewer public pretrained checkpoints than for Transformers (HuggingFace Hub)

Components

Reset GateSelective short-term memory filter — lets the model 'start fresh' at the right point in the sequence.

Gate deciding how much the previous hidden state h_{t-1} should influence the computation of the new state candidate h~_t. A value close to 0 = ignore the past; close to 1 = take all of the past into account. Sigmoid(W_r · [h_{t-1}, x_t] + b_r).

Update GateMain long-term memory mechanism — analog of the combined forget + input gate in LSTM.

Gate interpolating between the old state h_{t-1} and candidate h~_t. h_t = (1 - z_t) * h_{t-1} + z_t * h~_t. A value close to 0 = keep the old state (long-term memory); close to 1 = adopt the new state. Sigmoid(W_z · [h_{t-1}, x_t] + b_z).

Candidate Hidden StateIntermediate representation combining fresh input with modulated history — input signal to the update gate.

Proposed new hidden state computed as tanh(W · [r_t * h_{t-1}, x_t] + b). Contains 'new' information extracted from the current input x_t composed with the (selectively chosen by the reset gate) past.

Implementation

Implementation pitfalls

Sequential processing prevents full parallelismHigh

GRU step t depends on the state from step t-1, so the entire sequence cannot be computed in parallel on GPU. For long sequences (T>1000), training is significantly slower than Transformer, despite fewer parameters.

Fix:Use the cuDNN GRU implementation (optimized for multi-layer GRU + batch parallelism), use smaller sequences with truncated BPTT, or move from GRU to Transformer/Mamba for large sequences.

Difficulty with very long dependencies (>1000 steps)Medium

Despite selective-memory gates, GRU degrades on very long dependencies. The memory dissolves over iterations and the model 'forgets' older context. A typical practical limit is 100-500 steps of effective dependencies.

Fix:Hierarchical RNN (several layers at different time scales), attention over recent hidden states (Bahdanau attention), or migration to architectures with linear complexity (Mamba, Linear Attention) for very long sequences.

Exploding gradients in deep/long GRU networksMedium

While GRU handles vanishing gradients better than vanilla RNN, the gradient can explode for very deep (multi-layer) or very long sequences.

Fix:Gradient clipping (standard: clip-by-norm with threshold 1.0-5.0), layer normalization (LayerNorm on each GRU layer), Xavier/Glorot initialization.

Evolution

Original paper · 2014 · Kyunghyun Cho

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio

2014

Introduction of GRU by Cho et al.

Inflection point

The paper 'Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation' (arXiv 1406.1078) introduces GRU as a simplified variant of LSTM for machine translation. The first publication using an encoder-decoder with gated recurrent units.

2014

Empirical comparison of LSTM vs GRU (Chung et al.)

Chung, Gulcehre, Cho, Bengio publish 'Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling' (arXiv 1412.3555). They show GRU and LSTM achieve comparable performance on language and music modeling, with GRU slightly faster to train. The paper becomes the standard reference when choosing between GRU and LSTM.

2017

Transformer supersedes RNN/GRU in mainstream NLP

Inflection point

The 'Attention Is All You Need' paper (Vaswani et al., NeurIPS 2017) introduces the Transformer, which dramatically outperforms RNN/GRU in scaling thanks to parallel sequence processing on GPU. From this point GRU is gradually displaced in NLP, though it remains relevant in edge AI and streaming applications.

2023

Mamba and SSM — revived interest in recurrence

Mamba (Gu & Dao, 2023) and other State-Space Models show that gated recurrent models with linear complexity in sequence length can rival the Transformer on long contexts. While technically not GRUs, they inherit the idea of selective gated state-space memory. GRU gains new historical value as the first widely deployed proof of gated recurrence working.

Compute bottleneck

Sequential execution of time steps

GRU's main bottleneck is inherent sequentiality: step t depends on the state from step t-1, so steps cannot be parallelized. For a sequence of length T we have T sequential matrix operations of O(d²). This contrasts with the Transformer, which processes the whole sequence in parallel at the cost of O(T² · d) attention. For short sequences (T<100) GRU can be faster, but for long sequences (T>1000) the Transformer wins despite quadratic complexity in T.

Execution paradigm

Primary mode

Conditional

Activation pattern

Input dependent

Parallelism

Parallelism level

Sequential

Scope

TrainingInference

Hardware requirements

Good fit

GRU maps well to GPU through batching (parallelism along batch dim) and cuDNN-optimized kernels for multi-layer GRU. But the sequence must be processed sequentially along the T axis, limiting speedup vs Transformer.

Good fit

GRU works well on CPU for small models (especially for a single batch during streaming inference). Less dependent on parallelism than Transformer.

aiArchitecture.profile.fitLevel.excellent

Small GRUs (a few dozen to a few hundred hidden neurons) run even on ARM Cortex-M microcontrollers via TFLite Micro. Few parameters + no attention make GRU a natural choice for edge AI in streaming signal processing (speech, IoT sensors).

Sources

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)

Paper

arXiv / EMNLP 2014

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling (Chung et al., 2014)

Paper

arXiv / NIPS 2014 Workshop

Long Short-Term Memory (Hochreiter & Schmidhuber, 1997) — LSTM foundation, to which GRU is an alternative

Paper

Neural Computation 9(8)

Attention Is All You Need (Vaswani et al., 2017) — Transformer supersedes GRU/LSTM in mainstream NLP

Paper

arXiv / NeurIPS 2017

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu & Dao, 2023) — revival of gated recurrence

Paper

arXiv