Architecture

PI

2023ActivePublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

The first practical "cheap context extension" method for RoPE-LLMs: linearly scales position indices down to fit within the pretraining range, recovering quality with a short fine-tune (~1000 steps) instead of a full long-context pretraining.

How it works

In standard RoPE, position pos rotates pairs of embedding dimensions with frequency ω_i = 1 / base^(2i/d). Position Interpolation, instead of feeding the original position into RoPE, feeds a rescaled one: pos' = pos · L_pretrain / L_target = pos / s, where s = L_target / L_pretrain is the scale factor. The entire target sequence of length L_target is "compressed" to fit within the position range [0, L_pretrain] known to the model from pretraining. Because pos' can be fractional while the model has only seen integer indices, quality drops — hence a short fine-tune is needed (Chen et al. report ~1000 steps for 2k → 8k, ~5000 for 2k → 32k), which stabilises attention on the new fractional positions. The operation is deterministic and introduces no learnable parameters beyond those updated by standard backprop during the short fine-tune.

Problem solved

RoPE-based LLMs degrade sharply when position exceeds the pretraining length — RoPE rotations for such positions were never seen and attention behaviour collapses. Full long-context pretraining is prohibitively expensive. PI was the first to show that the window can be extended 4×–8× cheaply — a few thousand fine-tune steps on publicly available hardware.

Components

Position rescaling (pos → pos/s)Position gate — compresses the target range into the pretraining range

The only real component of the method — dividing the position index by scale factor s before computing the RoPE rotation. Fully deterministic.

INOriginal position indices in the target window [0, L_target-1].

OUTCompressed indices in the range [0, L_pretrain-1], typically fractional.

Standard PI (Chen et al.)Linear pos → pos/s with ~1000–5000 fine-tune steps.

NTK-aware (alternative)Instead of scaling indices, modifies the RoPE base — works without fine-tuning.

YaRN (successor)NTK-by-parts + attention-temperature scaling — higher quality than PI after fine-tune.

Official

Implementation

Reference implementations

Hugging Face Transformers — rope_scaling: "linear"

Python · Hugging Face

llama.cpp — RoPE linear scaling

C/C++ · ggerganov and community

vLLM — rope_scaling type "linear"

Python / CUDA · vLLM project

Together AI LLaMA-2-7B-32K-Instruct (example PI fine-tuned checkpoint)

Python · Together AI

Implementation pitfalls

No fine-tuning after enabling PICritical

PI without fine-tuning yields poor results — fractional positions are new to the model and quality clearly drops. NTK-aware works without fine-tuning, PI does not.

Fix:Fine-tune the model on a small budget of long sequences (~1000–5000 steps) per Chen et al., or use NTK-aware as a no-fine-tune alternative.

Uniformly compressing high frequenciesMedium

PI scales all RoPE dimensions uniformly, including high-frequency ones encoding local relations. This hurts short-range positional precision and is why NTK-aware/YaRN yield better results.

Fix:For maximum long-context quality prefer YaRN (NTK-by-parts + temperature) or LongRoPE.

Loading a PI checkpoint with the wrong scale factor at inferenceHigh

If the model was fine-tuned with s=4 but inference uses s=1 or s=8, attention breaks — fractional positions are inconsistent with what the model learned.

Fix:Always use `rope_scaling.factor` matching the original checkpoint config.

Evolution

Original paper · 2023 · arXiv:2306.15595 (Meta AI / FAIR) · Shouyuan Chen

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian

2021

RoPE (Su et al.) — the foundation

Rotary Position Embeddings — position encoding via rotation of dimension pairs. All subsequent context-extension methods modify how RoPE operates.

RoPE (concept)

2023

kaiokendev — early PI intuition in the "SuperHOT" blog

Independently and slightly earlier, blogger kaiokendev publishes "SuperHOT" — simple RoPE-frequency lowering to make Llama work at 8k. Chen et al. cite this intuition as parallel inspiration.

2023

Position Interpolation — Meta paper

Inflection point

Chen, Wong, Chen, Tian (Meta AI) publish PI (arXiv:2306.15595). They formalise the idea of linear position scaling + short fine-tune, demonstrating good quality up to 32k tokens. This is the first academic paper on "cheap context extension" — triggering an avalanche of follow-up methods.

Extending Context Window of Large Language Models via Positional Interpolation (paper)

2023

NTK-aware Interpolation — community response

Reddit user bloc97 proposes NTK-aware: modifying the RoPE BASE (not indices), works WITHOUT fine-tuning. The first improvement over PI — showing that PI doesn't exploit RoPE's full potential.

NTK-aware (concept)

2023

YaRN — NTK-by-parts + temperature

Peng et al. combine NTK-by-parts (per-band) with attention-temperature scaling. Requires a short fine-tune, but yields higher quality than PI and NTK-aware. Becomes the standard for 64k–128k.

YaRN (concept)

2024

LongRoPE — 2M+ tokens via evolutionary search

Microsoft shows that hand-crafted formulas (PI, NTK-aware, YaRN) can be surpassed with evolutionary search of non-uniform factors per dimension and per position, reaching >2M tokens.

LongRoPE (concept)

PI

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements