Architecture

RPR

2018ActivePublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

Moves position information from the model input INTO the attention mechanism itself, as learned representations of RELATIVE distances between query and key (clipped to a window of ±k) — instead of adding absolute position encoding to the input embedding.

How it works

Classical self-attention computes score_ij = (W_Q x_i)·(W_K x_j) / √d. RPR extends the formula with learned relative position representations: score_ij = (W_Q x_i)·(W_K x_j + a^K_{i-j}) / √d, and the attention output gets an analogous term: out_i = Σ_j softmax(score)_ij · (W_V x_j + a^V_{i-j}). Tables a^K and a^V are small — they contain (2k+1) vectors corresponding to relative distances clipped to [-k, +k]. All distances |i-j| > k are mapped to the clipped value ±k, so the model sees "far / very far" as a single category. Absolute positions are NOT added to the input embedding — position information lives only inside the attention blocks. Each layer can have its own a^K, a^V tables or share them across layers (both variants are studied in the paper).

Problem solved

Absolute position encodings (sinusoidal, learned PE) model "position as a number", which is unnatural for many language tasks — grammar and meaning depend on the distance between words rather than their numeric position in the sentence. RPR shows that explicitly modelling the relation "two tokens to the left" yields significantly better results on machine translation (WMT En→De/En→Fr) than classical absolute PE — without input-side PE parameters.

Components

Relative Key Bias (a^K)Modifier of the Query-Key similarity matrix in attention

A small table of (2k+1) × d_z learned vectors, indexed by relative distance clipped to [-k, +k]. Added to the Key projection before the dot product with Query.

INMatrix of relative indices clip(i-j, -k, +k) for every token pair.

OUTPer-token-pair relative vector (after table lookup).

Per-layer a^KEach layer has its own a^K table — Shaw et al. variant.

Shared across layersOne a^K table shared across all layers — T5 variant.

T5 scalar bucketed biasScalar bias per head per logarithmic bucket instead of a d-dimensional vector.

Official

Relative Value Bias (a^V)Modifier of the Value stream in attention

A second, optional table of (2k+1) × d_z added to the Value projection in the attention output. In the original paper it yields a small improvement; T5 drops it.

INSame relative indices as for a^K.

OUTPer-token-pair value vector.

Official

Clipping function clip(i-j, -k, +k)Distance gate — defines the model's relative horizon

Function mapping any distance |i-j| to an index in [-k, +k]. All distances outside this range are treated identically — a key architectural decision of RPR.

Hard clipping (Shaw et al.)Hard clipping to ±k. Simple, but loses information for long-context.

Logarithmic bucketing (T5)Logarithmic bucketing — near distances have separate buckets, far distances are grouped into progressively wider bins.

Official

Implementation

Reference implementations

tensor2tensor — relative_attention_inner

Python (TensorFlow) · Google Brain (Shaw et al.)

Official

Hugging Face Transformers — T5RelativePositionBias

Python · Hugging Face / Google (T5)

Transformer-XL (kimiyoung/transformer-xl)

Python (PyTorch / TF) · Carnegie Mellon (Dai et al.)

Official

Implementation pitfalls

Naive O(T²·d) memory implementationHigh

Building the tensor of relative vectors for every pair (i, j) directly scales as T²·d, which explodes for long sequences. This historically limited RPR to short contexts.

Fix:Use the Music Transformer "skewing trick" or the T5 bucket variant, both reducing memory to O(T·d) or less.

Clipping to ±k losing information for long-contextMedium

For small k all distances > k are indistinguishable to the model — a loss for long-context. Too large k inflates parameter count and memory cost.

Fix:Use logarithmic bucketing (T5) or switch to RoPE/ALiBi, which provide continuous distinction of distances.

Mixing absolute PE with RPRLow

Shaw et al. show that RPR is a full replacement for absolute PE. Combining both does not improve results and increases parameter count.

Fix:Disable absolute input-side PE when using RPR.

Evolution

Original paper · 2018 · NAACL 2018 · Peter Shaw

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, Ashish Vaswani

2017

Transformer and absolute PE (Vaswani et al.)

The original Transformer introduces absolute sinusoidal / learned PE. The open question: is relative position modelled efficiently enough?

Transformer (concept)

2018

Shaw et al. — Relative Position Representations

Inflection point

Shaw, Uszkoreit, Vaswani publish RPR at NAACL 2018. They show that explicit modelling of relative distance in attention beats absolute PE on WMT 2014 and can replace input-side PE entirely.

Self-Attention with Relative Position Representations (paper)

2018

Music Transformer — skewing trick

Huang et al. (Google Brain) implement RPR efficiently for very long music sequences using a "skewing" trick that reduces memory from O(T²·d) to O(T·d).

2019

Transformer-XL — relative PE + segment recurrence

Dai et al. combine relative PE (a variant extending RPR) with segment recurrence, allowing the Transformer to maintain consistency over sequences much longer than a single attention window.

2020

T5 — simplified per-head relative bias

T5 (Raffel et al., Google) introduces a much simpler RPR variant: a scalar bias per head per bucket, with 32 logarithmic buckets. Shared across layers. This solution becomes popular in encoder-decoder LLMs.

2021

RoPE and ALiBi — further steps toward relative PE

RoPE (Su et al.) and ALiBi (Press et al.) refine the RPR idea in another direction: instead of learned relative vectors they use a deterministic function of distance (rotation vs linear bias). Both inherit RPR's central intuition: "what matters is distance, not absolute position".

RoPE (concept)

RPR

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements