Architecture

RoPE

2021Updated: 4 May 2026

Key innovation

Encodes token positions via vector rotations in complex space, enabling natural extrapolation to sequences longer than those seen during training.

How it works

Query and key vectors in the attention mechanism are rotated by an angle proportional to the token position before computing the dot product. This makes attention between tokens depend on their relative distances rather than absolute positions.

Problem solved

Standard positional encoding (additive or sinusoidal) generalizes poorly to sequences longer than seen during training. RoPE encodes positions through matrix rotation, which naturally transfers to longer sequences.

Implementation

Implementation pitfalls

Degradation when extrapolating beyond training contextMedium

RoPE trained on sequences up to N tokens degrades for sequences >N without extrapolation techniques (YaRN, LongRoPE, NTK-aware scaling). Naive context extension leads to chaotic attention.

Implementation requires float32 precision for small anglesMedium

At large positions (e.g. position 100k) rotation angles become very small — float16/bfloat16 computations can cause numerical errors. Recommended: compute RoPE in float32, cast to bf16 after application.

Sources

RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)