Classical self-attention computes score_ij = (W_Q x_i)·(W_K x_j) / √d. RPR extends the formula with learned relative position representations: score_ij = (W_Q x_i)·(W_K x_j + a^K_{i-j}) / √d, and the attention output gets an analogous term: out_i = Σ_j softmax(score)_ij · (W_V x_j + a^V_{i-j}). Tables a^K and a^V are small — they contain (2k+1) vectors corresponding to relative distances clipped to [-k, +k]. All distances |i-j| > k are mapped to the clipped value ±k, so the model sees "far / very far" as a single category. Absolute positions are NOT added to the input embedding — position information lives only inside the attention blocks. Each layer can have its own a^K, a^V tables or share them across layers (both variants are studied in the paper).
Absolute position encodings (sinusoidal, learned PE) model "position as a number", which is unnatural for many language tasks — grammar and meaning depend on the distance between words rather than their numeric position in the sentence. RPR shows that explicitly modelling the relation "two tokens to the left" yields significantly better results on machine translation (WMT En→De/En→Fr) than classical absolute PE — without input-side PE parameters.
A small table of (2k+1) × d_z learned vectors, indexed by relative distance clipped to [-k, +k]. Added to the Key projection before the dot product with Query.
Official
A second, optional table of (2k+1) × d_z added to the Value projection in the attention output. In the original paper it yields a small improvement; T5 drops it.
Official
Function mapping any distance |i-j| to an index in [-k, +k]. All distances outside this range are treated identically — a key architectural decision of RPR.
Official
Building the tensor of relative vectors for every pair (i, j) directly scales as T²·d, which explodes for long sequences. This historically limited RPR to short contexts.
For small k all distances > k are indistinguishable to the model — a loss for long-context. Too large k inflates parameter count and memory cost.
Shaw et al. show that RPR is a full replacement for absolute PE. Combining both does not improve results and increases parameter count.
The original Transformer introduces absolute sinusoidal / learned PE. The open question: is relative position modelled efficiently enough?
Shaw, Uszkoreit, Vaswani publish RPR at NAACL 2018. They show that explicit modelling of relative distance in attention beats absolute PE on WMT 2014 and can replace input-side PE entirely.
Huang et al. (Google Brain) implement RPR efficiently for very long music sequences using a "skewing" trick that reduces memory from O(T²·d) to O(T·d).
Dai et al. combine relative PE (a variant extending RPR) with segment recurrence, allowing the Transformer to maintain consistency over sequences much longer than a single attention window.
T5 (Raffel et al., Google) introduces a much simpler RPR variant: a scalar bias per head per bucket, with 32 logarithmic buckets. Shared across layers. This solution becomes popular in encoder-decoder LLMs.
RoPE (Su et al.) and ALiBi (Press et al.) refine the RPR idea in another direction: instead of learned relative vectors they use a deterministic function of distance (rotation vs linear bias). Both inherit RPR's central intuition: "what matters is distance, not absolute position".
Time complexity: O(T² · d) dla score + a^K (jak vanilla attention, ze stałym narzutem). Space complexity: O((2k+1) · d) parametrów; O(T² · d) aktywacji w naiwnej impl., O(T · d) ze skewing trickiem.
The naive implementation must materialise a [T, T, d_z] tensor of relative vectors, which explodes memory-wise for large T. This was RPR's main limitation before the T5/skewing-trick variants.
Maximum relative distance for which separate vectors are learned. All positions farther than k are clipped to ±k. The original paper achieves best results with k = 16; T5 uses a variant with 32 logarithmic buckets.
Whether to also modify the Value stream (a^V) or only Key (a^K). Shaw et al. report a small gain from adding a^V — T5 and several later works drop it.
Whether a^K/a^V tables are shared across layers (as in T5) or per-layer (as in the Shaw et al. variant). Sharing reduces parameter count substantially without noticeable quality loss.
T5 reduces RPR to a scalar bias per head per bucket (instead of a d-dimensional vector). Each attention head gets its own "map" of how to distribute attention over distances.
RPR modifies only the dense attention computation — no routing and no conditional activation.
The additional a^K, a^V vectors fit into the same parallel structure as standard attention. The original RPR implementation has O(T²·d) overhead to build the relative-vector matrix — Music Transformer introduces a "skewing trick" reducing memory to O(T·d) without losing parallelism.
RPR is a purely algorithmic attention modification — small-table lookup + addition. No hardware requirements beyond a typical accelerator.
Works well with optimised attention kernels (e.g. the T5 scalar-bias variant is fused into FlashAttention for T5-like models).