Architecture

YaRN

2023ActivePublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

Extends the context window of RoPE-based LLMs far beyond the pretraining length using "NTK-by-parts" RoPE interpolation combined with attention-temperature scaling, requiring only minimal fine-tuning.

How it works

YaRN modifies RoPE rotations in a frequency-dependent way ("NTK-by-parts"): high-frequency dimensions (encoding local, short-range relations) are extrapolated without interpolation, low-frequency dimensions (encoding long-range dependencies) are interpolated as in Position Interpolation, and intermediate dimensions are smoothly blended between the two regimes. In addition, a constant temperature factor is added to attention logits (1/sqrt(t) ~ log(s) scaling) to correct the attention entropy at longer sequence lengths. The modified model is then fine-tuned on a small budget of long sequences (on the order of ~0.1% of pretraining tokens) to stabilise quality. A "Dynamic-YaRN" variant applies the scaling only when the sequence length actually exceeds the pretraining length, minimising regression on short prompts.

Problem solved

RoPE-based LLMs degrade sharply when the context length exceeds the length seen during pretraining — positions outside the training range were never observed and naively extrapolating RoPE leads to attention collapse. Earlier methods (Position Interpolation, NTK-aware interpolation) required longer fine-tuning or coped worse with perplexity at very long contexts.

Implementation

Reference implementations

jquesnelle/yarn (official repo)

Python (PyTorch) · Jeffrey Quesnelle et al. (paper authors)

Official

Hugging Face Transformers — rope_scaling: "yarn"

Python · Hugging Face

vLLM — rope_scaling type "yarn" support

Python / CUDA · vLLM project

llama.cpp — YaRN rope scaling

C/C++ · ggerganov and community

Implementation pitfalls

No fine-tuning after enabling YaRNHigh

YaRN without at least a short fine-tuning on long sequences yields significantly worse quality than the fine-tuned variant — especially on "needle in a haystack" tasks.

Fix:Fine-tune the model on a small budget of long sequences (on the order of ~0.1% of pretraining tokens) following the recipe from the original paper.

Quality regression on short prompts after static YaRNMedium

Permanently enabling YaRN for all input lengths can slightly reduce model quality on short prompts (shorter than the pretraining length).

Fix:Use the Dynamic-YaRN variant, which activates scaling only when the current sequence length exceeds the pretraining length.

Skipping attention temperature scalingMedium

Some implementations apply only the RoPE interpolation (NTK-by-parts) and skip the attention-temperature scaling — results then resemble NTK-aware rather than full YaRN.

Fix:Ensure attention logits are multiplied by the constant temperature factor as defined in the paper.

Evolution

Original paper · 2023 · arXiv:2309.00071 (later ICLR 2024) · Bowen Peng

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole

2021

RoPE (Rotary Position Embeddings)

Su et al. introduce RoPE — encoding positions via rotation of embedding dimension pairs, the foundation for the entire line of context-extension work.

RoPE (concept)

2023

Position Interpolation (PI)

Chen et al. (Meta) show that linearly scaling RoPE position indices extends Llama context with short fine-tuning. The first breakthrough in "cheap context extension".

2023

NTK-aware interpolation

The community (Reddit user "bloc97") proposes NTK-aware interpolation: instead of uniformly scaling indices, the RoPE base is modified, preserving sharpness of local dimensions. Works better than PI without fine-tuning.

2023

YaRN — paper on arXiv

Inflection point

Peng, Quesnelle, Fan, Shippole publish YaRN (arXiv:2309.00071). They combine NTK-by-parts (different interpolation regimes per frequency band) with attention-temperature scaling, fine-tune on ~0.1% of pretraining tokens, and outperform Position Interpolation and earlier NTK-aware methods on long contexts.

YaRN: Efficient Context Window Extension of Large Language Models (paper)

2023

Llama 2 64k/128k YaRN checkpoints

The authors release open Llama 2 7B/13B checkpoints fine-tuned with YaRN to 64k and 128k context windows, which become popular long-context open-source LLM references.

2024

YaRN accepted at ICLR 2024

The paper is accepted at ICLR 2024 and YaRN becomes the de-facto standard for context extension in open-source LLMs (Qwen, Mistral-derived models, DeepSeek-V2/V3, Yi-200k, many Llama fine-tunes).

YaRN

How it works

Problem solved

Implementation

Evolution

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements