Architecture

ALiBi

2021ActivePublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

Replaces classical positional encoding (sinusoidal, learned, RoPE) with a static, non-learned linear bias added to attention logits, proportional to the distance between query and key — enabling sequence-length extrapolation far beyond the training length ("train short, test long").

How it works

In a standard Transformer, positional encoding is added to each input embedding and attention is computed from query·key dot products alone. In ALiBi positional embeddings are removed — instead, a static bias of the form -m·|i-j| is added to the attention logit matrix, where i,j are query and key positions and m is a head-specific constant slope. The slopes m form a geometric sequence (e.g. for 8 heads: 1/2, 1/4, 1/8, …, 1/256), so different heads "look" at different context ranges — near heads see locally, far heads cover the whole context. The bias is static, with no learned position parameters. As a result, a model trained on sequences of length L correctly extrapolates to 2L, 4L, and beyond, because the bias is well-defined for any distance |i-j|.

Problem solved

Classical positional encodings — sinusoidal, learned, and to a lesser extent RoPE — extrapolate poorly to lengths longer than seen in training. Positions outside the pretraining range are "new" to the model and quality drops sharply. Earlier approaches required either longer training or fine-tuning (Position Interpolation, YaRN). ALiBi solves the problem structurally: the bias is a function of distance, not absolute position, so it works for arbitrarily long sequences without model modification.

Implementation

Reference implementations

ofirpress/attention_with_linear_biases (official repo)

Python (PyTorch / Fairseq) · Ofir Press

Official

BLOOM (BigScience) — reference LLM using ALiBi

Python · BigScience / Hugging Face

MPT-7B / MPT-30B (MosaicML)

Python · MosaicML / Databricks

Implementation pitfalls

Combining ALiBi with a separate positional embeddingHigh

ALiBi by design replaces positional embeddings. Keeping sinusoidal/learned/RoPE together with ALiBi produces a double position signal and worsens results.

Fix:Completely remove positional encoding when ALiBi is enabled.

Wrong slopes for a non-standard number of headsMedium

The 2^(-8/n) geometric pattern assumes n = power of two. For non-standard n the authors provide specific formula extensions — skipping them lowers quality.

Fix:Use slopes computed per the procedure in the official Press et al. repository.

Assuming ALiBi always beats RoPE+YaRN/LongRoPELow

ALiBi yields strong extrapolation "for free", but on long-context benchmarks (NIAH, RULER, LongBench) RoPE + YaRN/LongRoPE based models typically score higher at comparable scale.

Fix:Choose ALiBi when simplicity and low long-inference cost are the priority. For maximum long-context quality prefer RoPE + YaRN/LongRoPE.

Evolution

Original paper · 2021 · arXiv:2108.12409 (later ICLR 2022) · Ofir Press

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A. Smith, Mike Lewis

2017

Sinusoidal positional encoding (Vaswani et al.)

The original Transformer introduces sinusoidal positional encoding as an additive embedding. Extrapolates poorly to lengths longer than in training — the starting point for all later alternatives.

Transformer (concept)

2021

RoPE (Su et al.)

Rotary Position Embeddings — an alternative method of encoding positions via rotation of dimension pairs. Better than sinusoidal but also with limited extrapolation without PI/YaRN-style modifications.

RoPE (concept)

2021

ALiBi — Press et al. paper

Inflection point

Press, Smith, Lewis publish ALiBi (arXiv:2108.12409). They show that a static linear bias in attention replaces positional embeddings and yields strong length extrapolation ("train short, test long").

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation (paper)

2022

ICLR 2022 acceptance

ALiBi is accepted at ICLR 2022. The idea starts being adopted in new open LLMs.

2022

BLOOM (BigScience) uses ALiBi

BLOOM-176B — the first large open multilingual LLM — chooses ALiBi as its positional encoding, popularising the method in the open-source community.

2023

MPT (MosaicML) and BloombergGPT — production deployments of ALiBi

MosaicML releases the MPT family (7B/30B) with ALiBi, marketing the "context length flexibility" capability. BloombergGPT-50B also relies on ALiBi. ALiBi becomes an established alternative to RoPE.

2024

RoPE dominance in newer models

Most new large LLMs (Llama 2/3, Qwen, DeepSeek, Mistral) choose RoPE + YaRN/LongRoPE as the standard long-context path. ALiBi remains chosen mainly where simplicity of deployment and "free" extrapolation matter more than absolute benchmark quality.

ALiBi

How it works

Problem solved

Implementation

Evolution

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements