Query and key vectors in the attention mechanism are rotated by an angle proportional to the token position before computing the dot product. This makes attention between tokens depend on their relative distances rather than absolute positions.
Standard positional encoding (additive or sinusoidal) generalizes poorly to sequences longer than seen during training. RoPE encodes positions through matrix rotation, which naturally transfers to longer sequences.
RoPE trained on sequences up to N tokens degrades for sequences >N without extrapolation techniques (YaRN, LongRoPE, NTK-aware scaling). Naive context extension leads to chaotic attention.
At large positions (e.g. position 100k) rotation angles become very small — float16/bfloat16 computations can cause numerical errors. Recommended: compute RoPE in float32, cast to bf16 after application.
RoPE consists of element-wise trigonometric operations applied to each Q and K vector — fully GPU-accelerated, often fused with attention kernels (FlashAttention-3 supports RoPE fusion).