Architecture

LongRoPE

2024ActivePublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

Extends the context window of RoPE-based LLMs beyond 2 million tokens by identifying two non-uniformities in positional interpolation (per dimension and per token position) and searching their rescaling factors with an evolutionary algorithm, combined with a progressive extension strategy (256k → 2048k) and a short-context readjustment step.

How it works

LongRoPE introduces two non-uniformities: (1) per-dimension — different RoPE dimensions get different, non-monotonic interpolation factors, and (2) per-token — the first tokens receive smaller scaling than later positions. The optimal values of these factors are searched offline with an evolutionary algorithm (mutation + selection by perplexity on a sample of long texts) rather than derived from a formula. A progressive extension strategy follows: the model is first searched and fine-tuned to stabilise at 256k, then the algorithm searches factors for 2048k WITHOUT further fine-tuning. The final step is readjustment: on short contexts (4k/8k) separate, slightly modified RoPE factors are used, recovering the model's quality on typical short prompts.

Problem solved

Uniform RoPE interpolation methods (Position Interpolation, NTK-aware, YaRN) assume that all dimensions and positions need the same scaling. Empirically this is not the case — different RoPE frequencies and different positions (especially the first tokens) respond differently. Attempting extreme extension (hundreds of thousands or millions of tokens) with these methods leads to a sharp rise in perplexity and complete attention collapse in long context.

Implementation

Reference implementations

microsoft/LongRoPE (official repo)

Python (PyTorch) · Microsoft Research (paper authors)

Official

Hugging Face Transformers — rope_scaling: "longrope"

Python · Hugging Face

Phi-3 Mini 128k Instruct (reference deployment)

Python · Microsoft

Official

Implementation pitfalls

Skipping the evolutionary search stepCritical

Attempting to reproduce LongRoPE with arbitrary, uniform factors effectively reduces it to YaRN/PI and removes the main benefit (scaling to millions of tokens).

Fix:Run the evolutionary algorithm (code from microsoft/LongRoPE) on a sample of long texts for the target context length, or use published factors for a given model (e.g., Phi-3 128k).

Missing short-context readjustmentHigh

Without separate factors for sequences shorter than the pretraining length, LongRoPE noticeably reduces quality on short prompts — analogous to static YaRN but stronger due to the more aggressive scaling.

Fix:Apply two sets of factors (short/long context) following the readjustment procedure described in the paper.

Jumping to 2M without the progressive 256k stageMedium

The paper shows that searching factors directly for 2M without a prior stabilising fine-tuning at 256k worsens quality and destabilises the search.

Fix:Keep the two-stage strategy: search + fine-tune to 256k, then search to 2048k without further fine-tuning.

Evolution

Original paper · 2024 · arXiv:2402.13753 (later ICML 2024) · Yiran Ding

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ningxin Zheng, Jiahang Xu, Fan Yang, Mao Yang

2021

RoPE (Rotary Position Embeddings)

Su et al. introduce RoPE — the foundation for the whole family of context extension methods, including Position Interpolation, NTK-aware, YaRN, and LongRoPE.

RoPE (concept)

2023

Position Interpolation and NTK-aware

Position Interpolation (Meta) and NTK-aware interpolation (community) show that uniform scaling/base rebuilding of RoPE extends context with short fine-tuning.

2023

YaRN — NTK-by-parts + temperature

YaRN combines NTK-by-parts (different interpolation regimes per frequency band) with attention temperature scaling and becomes the de-facto standard for 64k–128k.

YaRN (concept)

2024

LongRoPE — Microsoft paper

Inflection point

Microsoft publishes LongRoPE (arXiv:2402.13753). Two non-uniformities (per-dimension and per-token), evolutionary search, progressive 256k → 2048k strategy, and short-context readjustment — the first method to reach a >2M token context window.

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens (paper)

2024

Phi-3 / Phi-3.5 with LongRoPE

Microsoft releases Phi-3 Mini 128k and Phi-3.5 Mini 128k models, in which LongRoPE is the official method to extend the context from 4k to 128k — the first wide production deployment.

2024

ICML 2024 acceptance

LongRoPE is published at ICML 2024 as a major contribution to the problem of extreme LLM context extension.

Hyperparameters (configurable axes)

Target context lengthCritical

Target context window length after extension. The original paper demonstrates scaling from 4k to 2048k (×512).

128k

256k

2048k

Per-dimension RoPE rescale factorsCritical

Non-monotonic vector of rescaling factors, one per RoPE dimension. Searched by the evolutionary algorithm. The main novelty of LongRoPE over YaRN/NTK-aware.

Initial token rescale (n̂)High

Number of initial tokens for which a smaller position scaling is applied than for later positions. The second non-uniformity discovered experimentally.

Evolutionary search budgetMedium

Number of evolutionary algorithm iterations and population size. Determines the offline cost of the factor search.

Short-context readjust factorsHigh

Separate set of RoPE factors used when the sequence is shorter than the pretraining length, to preserve quality on typical short prompts.

LongRoPE

How it works

Problem solved

Implementation

Evolution

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements