LongRoPE introduces two non-uniformities: (1) per-dimension โ different RoPE dimensions get different, non-monotonic interpolation factors, and (2) per-token โ the first tokens receive smaller scaling than later positions. The optimal values of these factors are searched offline with an evolutionary algorithm (mutation + selection by perplexity on a sample of long texts) rather than derived from a formula. A progressive extension strategy follows: the model is first searched and fine-tuned to stabilise at 256k, then the algorithm searches factors for 2048k WITHOUT further fine-tuning. The final step is readjustment: on short contexts (4k/8k) separate, slightly modified RoPE factors are used, recovering the model's quality on typical short prompts.
Uniform RoPE interpolation methods (Position Interpolation, NTK-aware, YaRN) assume that all dimensions and positions need the same scaling. Empirically this is not the case โ different RoPE frequencies and different positions (especially the first tokens) respond differently. Attempting extreme extension (hundreds of thousands or millions of tokens) with these methods leads to a sharp rise in perplexity and complete attention collapse in long context.
Attempting to reproduce LongRoPE with arbitrary, uniform factors effectively reduces it to YaRN/PI and removes the main benefit (scaling to millions of tokens).
Without separate factors for sequences shorter than the pretraining length, LongRoPE noticeably reduces quality on short prompts โ analogous to static YaRN but stronger due to the more aggressive scaling.
The paper shows that searching factors directly for 2M without a prior stabilising fine-tuning at 256k worsens quality and destabilises the search.
Su et al. introduce RoPE โ the foundation for the whole family of context extension methods, including Position Interpolation, NTK-aware, YaRN, and LongRoPE.
Position Interpolation (Meta) and NTK-aware interpolation (community) show that uniform scaling/base rebuilding of RoPE extends context with short fine-tuning.
YaRN combines NTK-by-parts (different interpolation regimes per frequency band) with attention temperature scaling and becomes the de-facto standard for 64kโ128k.
Microsoft publishes LongRoPE (arXiv:2402.13753). Two non-uniformities (per-dimension and per-token), evolutionary search, progressive 256k โ 2048k strategy, and short-context readjustment โ the first method to reach a >2M token context window.
Microsoft releases Phi-3 Mini 128k and Phi-3.5 Mini 128k models, in which LongRoPE is the official method to extend the context from 4k to 128k โ the first wide production deployment.
LongRoPE is published at ICML 2024 as a major contribution to the problem of extreme LLM context extension.
Target context window length after extension. The original paper demonstrates scaling from 4k to 2048k (ร512).
Non-monotonic vector of rescaling factors, one per RoPE dimension. Searched by the evolutionary algorithm. The main novelty of LongRoPE over YaRN/NTK-aware.
Number of initial tokens for which a smaller position scaling is applied than for later positions. The second non-uniformity discovered experimentally.
Number of evolutionary algorithm iterations and population size. Determines the offline cost of the factor search.
Separate set of RoPE factors used when the sequence is shorter than the pretraining length, to preserve quality on typical short prompts.
LongRoPE is a positional-encoding modification within a dense Transformer. There is no routing or conditional activation.
LongRoPE does not change the cost of a single attention operation โ it only modifies how RoPE rotations are computed. The factor search runs offline, once before model deployment, and does not affect inference cost.
The runtime RoPE modification is purely algorithmic and works wherever a standard RoPE-Transformer works.
The offline evolutionary search requires running perplexity evaluations on long contexts โ in practice GPU clusters with FlashAttention and โฅ80 GB memory. Long-context inference typically relies on PagedAttention/FlashAttention.