YaRN modifies RoPE rotations in a frequency-dependent way ("NTK-by-parts"): high-frequency dimensions (encoding local, short-range relations) are extrapolated without interpolation, low-frequency dimensions (encoding long-range dependencies) are interpolated as in Position Interpolation, and intermediate dimensions are smoothly blended between the two regimes. In addition, a constant temperature factor is added to attention logits (1/sqrt(t) ~ log(s) scaling) to correct the attention entropy at longer sequence lengths. The modified model is then fine-tuned on a small budget of long sequences (on the order of ~0.1% of pretraining tokens) to stabilise quality. A "Dynamic-YaRN" variant applies the scaling only when the sequence length actually exceeds the pretraining length, minimising regression on short prompts.
RoPE-based LLMs degrade sharply when the context length exceeds the length seen during pretraining — positions outside the training range were never observed and naively extrapolating RoPE leads to attention collapse. Earlier methods (Position Interpolation, NTK-aware interpolation) required longer fine-tuning or coped worse with perplexity at very long contexts.
YaRN without at least a short fine-tuning on long sequences yields significantly worse quality than the fine-tuned variant — especially on "needle in a haystack" tasks.
Permanently enabling YaRN for all input lengths can slightly reduce model quality on short prompts (shorter than the pretraining length).
Some implementations apply only the RoPE interpolation (NTK-by-parts) and skip the attention-temperature scaling — results then resemble NTK-aware rather than full YaRN.
Su et al. introduce RoPE — encoding positions via rotation of embedding dimension pairs, the foundation for the entire line of context-extension work.
Chen et al. (Meta) show that linearly scaling RoPE position indices extends Llama context with short fine-tuning. The first breakthrough in "cheap context extension".
The community (Reddit user "bloc97") proposes NTK-aware interpolation: instead of uniformly scaling indices, the RoPE base is modified, preserving sharpness of local dimensions. Works better than PI without fine-tuning.
Peng, Quesnelle, Fan, Shippole publish YaRN (arXiv:2309.00071). They combine NTK-by-parts (different interpolation regimes per frequency band) with attention-temperature scaling, fine-tune on ~0.1% of pretraining tokens, and outperform Position Interpolation and earlier NTK-aware methods on long contexts.
The authors release open Llama 2 7B/13B checkpoints fine-tuned with YaRN to 64k and 128k context windows, which become popular long-context open-source LLM references.
The paper is accepted at ICLR 2024 and YaRN becomes the de-facto standard for context extension in open-source LLMs (Qwen, Mistral-derived models, DeepSeek-V2/V3, Yi-200k, many Llama fine-tunes).
Context extension ratio: target context length / pretraining length. For Llama 2 (4k) → 32k this is s=8.
Constant factor multiplying attention logits (1/sqrt(t), with a recommended log-type dependence on the scale factor). Corrects attention entropy at longer sequence lengths.
Thresholds splitting RoPE dimensions into three regimes: pure extrapolation (high frequencies), a transition zone, and pure interpolation (low frequencies).
Number of long-sequence tokens needed to fine-tune the YaRN-modified model. The original paper reports ~0.1% of pretraining tokens is sufficient.
YaRN is a positional-encoding modification within a dense Transformer. The mechanism itself is deterministic and introduces neither routing nor conditional activation.
YaRN does not change the compute cost of a single attention operation — it only modifies how RoPE rotations are computed and adds an attention-temperature scaling. Parallelism behaves exactly like a standard RoPE-based Transformer.
YaRN is a purely algorithmic modification of positional encoding and attention scaling — it requires no special hardware instructions or custom kernels.
YaRN models are fine-tuned on standard GPU clusters; long-context inference typically relies on FlashAttention/PagedAttention, which use tensor cores effectively.