Architecture

Bahdanau Attention

2014HistoricalPublished: 28 May 2026Updated: 28 May 2026Published

Key innovation

Introduction of a learnable soft-alignment mechanism in neural machine translation that removes the fixed-length context vector bottleneck of the encoder–decoder architecture.

How it works

At every decoder step t the mechanism performs three operations: (1) for each encoder hidden state h_j and the previous decoder state s_{t-1} it computes a scalar alignment score e_{t,j} = v^T · tanh(W_a · s_{t-1} + U_a · h_j) — a small single-hidden-layer MLP; (2) scores are normalised via softmax into alignment weights α_{t,j}; (3) the context vector c_t = Σ_j α_{t,j} · h_j is fed to the decoder together with the previous token and state to produce the next token. All parameters (W_a, U_a, v) are learned end-to-end together with the encoder and decoder.

Problem solved

In the standard RNN-based encoder–decoder architecture the entire source sentence is compressed into a single fixed-length vector, creating an information bottleneck — especially for long sentences — and causing translation quality to degrade sharply as input length grows.

Components

Alignment scoring networkComputes attention energy/score.

Small feed-forward network with a single tanh hidden layer that produces a scalar alignment score for every (decoder state, encoder state) pair.

Official

Softmax normalizationTurns scores into attention weights.

Normalises the scores into a probability distribution over all source positions — the alignment weights α_{t,j}.

Context vectorDynamic source representation for the current decoder step.

Weighted sum of encoder hidden states, fed to the decoder as an additional input when generating the next token.

Bidirectional RNN encoderProduces the sequence of source hidden representations.

In the original paper the encoder is a bidirectional GRU; its hidden states h_j feed into the attention mechanism.

Official

Implementation

Reference implementations

TensorFlow Addons – BahdanauAttention

Python · TensorFlow

PyTorch tutorial — NMT with attention

Python · PyTorch

Implementation pitfalls

Slower inference than dot-product attentionMedium

The tanh MLP for each (decoder, encoder) pair is more expensive than the plain dot product used in Luong/Transformer attention.

Fix:If throughput matters — prefer dot-product (Luong) or scaled dot-product (Transformer) attention.

Sequential RNN decoderHigh

The mechanism is embedded in a recurrent decoder — steps cannot be parallelised in time, limiting GPU scaling.

Fix:Replacing the RNN with a Transformer (self-attention) eliminates this constraint.

Evolution

Original paper · 2014 · ICLR 2015 (oral) · Dzmitry Bahdanau

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

2014

arXiv 1409.0473 published

Inflection point

First version of the paper introducing attention in NMT.

2015

Oral presentation at ICLR 2015

Paper accepted as oral at ICLR 2015 — rapid adoption of the idea by the community.

2015

Luong Attention

Luong, Pham and Manning propose multiplicative attention variants (dot, general, concat) as a simplification and extension of Bahdanau Attention.

Effective Approaches to Attention-based Neural Machine Translation (paper)

2017

'Attention Is All You Need' — Transformer

Inflection point

Vaswani et al. drop RNNs entirely and build the architecture purely on scaled dot-product self-attention — a direct continuation of the line started by Bahdanau Attention.

Attention Is All You Need (paper)