At every decoder step t the mechanism performs three operations: (1) for each encoder hidden state h_j and the previous decoder state s_{t-1} it computes a scalar alignment score e_{t,j} = v^T · tanh(W_a · s_{t-1} + U_a · h_j) — a small single-hidden-layer MLP; (2) scores are normalised via softmax into alignment weights α_{t,j}; (3) the context vector c_t = Σ_j α_{t,j} · h_j is fed to the decoder together with the previous token and state to produce the next token. All parameters (W_a, U_a, v) are learned end-to-end together with the encoder and decoder.
In the standard RNN-based encoder–decoder architecture the entire source sentence is compressed into a single fixed-length vector, creating an information bottleneck — especially for long sentences — and causing translation quality to degrade sharply as input length grows.
Small feed-forward network with a single tanh hidden layer that produces a scalar alignment score for every (decoder state, encoder state) pair.
Official
Normalises the scores into a probability distribution over all source positions — the alignment weights α_{t,j}.
Weighted sum of encoder hidden states, fed to the decoder as an additional input when generating the next token.
In the original paper the encoder is a bidirectional GRU; its hidden states h_j feed into the attention mechanism.
Official
The tanh MLP for each (decoder, encoder) pair is more expensive than the plain dot product used in Luong/Transformer attention.
The mechanism is embedded in a recurrent decoder — steps cannot be parallelised in time, limiting GPU scaling.
First version of the paper introducing attention in NMT.
Paper accepted as oral at ICLR 2015 — rapid adoption of the idea by the community.
Luong, Pham and Manning propose multiplicative attention variants (dot, general, concat) as a simplification and extension of Bahdanau Attention.
Vaswani et al. drop RNNs entirely and build the architecture purely on scaled dot-product self-attention — a direct continuation of the line started by Bahdanau Attention.
Time complexity: O(T_x · T_y · d). Space complexity: O(T_x · T_y).
Each decoder step uses all source positions (soft attention).
Because Bahdanau Attention is embedded in a recurrent decoder (RNN/GRU), token generation is sequential; the attention operation at a given step t can be vectorised over source positions, but decoder steps must run one after another.
Hidden size of the scoring MLP (typically equal to the encoder hidden size).
Hidden size of the encoder (bidirectional RNN/GRU) hidden states.
Operations are matrix-based, but the sequential RNN decoder limits tensor-core utilisation compared to a pure Transformer.
The mechanism itself is a small MLP plus softmax — runs on essentially any accelerator that supports standard neural network ops.