In a standard Transformer, positional encoding is added to each input embedding and attention is computed from query·key dot products alone. In ALiBi positional embeddings are removed — instead, a static bias of the form -m·|i-j| is added to the attention logit matrix, where i,j are query and key positions and m is a head-specific constant slope. The slopes m form a geometric sequence (e.g. for 8 heads: 1/2, 1/4, 1/8, …, 1/256), so different heads "look" at different context ranges — near heads see locally, far heads cover the whole context. The bias is static, with no learned position parameters. As a result, a model trained on sequences of length L correctly extrapolates to 2L, 4L, and beyond, because the bias is well-defined for any distance |i-j|.
Classical positional encodings — sinusoidal, learned, and to a lesser extent RoPE — extrapolate poorly to lengths longer than seen in training. Positions outside the pretraining range are "new" to the model and quality drops sharply. Earlier approaches required either longer training or fine-tuning (Position Interpolation, YaRN). ALiBi solves the problem structurally: the bias is a function of distance, not absolute position, so it works for arbitrarily long sequences without model modification.
ALiBi by design replaces positional embeddings. Keeping sinusoidal/learned/RoPE together with ALiBi produces a double position signal and worsens results.
The 2^(-8/n) geometric pattern assumes n = power of two. For non-standard n the authors provide specific formula extensions — skipping them lowers quality.
ALiBi yields strong extrapolation "for free", but on long-context benchmarks (NIAH, RULER, LongBench) RoPE + YaRN/LongRoPE based models typically score higher at comparable scale.
The original Transformer introduces sinusoidal positional encoding as an additive embedding. Extrapolates poorly to lengths longer than in training — the starting point for all later alternatives.
Rotary Position Embeddings — an alternative method of encoding positions via rotation of dimension pairs. Better than sinusoidal but also with limited extrapolation without PI/YaRN-style modifications.
Press, Smith, Lewis publish ALiBi (arXiv:2108.12409). They show that a static linear bias in attention replaces positional embeddings and yields strong length extrapolation ("train short, test long").
ALiBi is accepted at ICLR 2022. The idea starts being adopted in new open LLMs.
BLOOM-176B — the first large open multilingual LLM — chooses ALiBi as its positional encoding, popularising the method in the open-source community.
MosaicML releases the MPT family (7B/30B) with ALiBi, marketing the "context length flexibility" capability. BloombergGPT-50B also relies on ALiBi. ALiBi becomes an established alternative to RoPE.
Most new large LLMs (Llama 2/3, Qwen, DeepSeek, Mistral) choose RoPE + YaRN/LongRoPE as the standard long-context path. ALiBi remains chosen mainly where simplicity of deployment and "free" extrapolation matter more than absolute benchmark quality.
Vector of bias slopes, one per attention head. In the original paper this is a geometric sequence with base 2^(-8/n), where n is the number of heads.
Sequence length used during pretraining. Thanks to ALiBi extrapolation, the model performs well on sequences of 2L–4L without fine-tuning.
ALiBi by design REPLACES classical positional encoding (sinusoidal/learned/RoPE) — the model is trained WITHOUT positional embeddings. Mixing ALiBi with RoPE is not a standard configuration.
ALiBi only modifies the attention weight — it remains a fully dense, deterministic mechanism without routing or conditional activation.
Adding the bias is a cheap vectorised operation on the attention logit matrix. No additional sequential dependencies — parallelism is identical to a standard Transformer.
ALiBi is a purely algorithmic attention modification — adding a static bias to the logit matrix. It requires no special kernels or hardware instructions.
Works well with FlashAttention (ALiBi kernel is supported out of the box). The bias can be precomputed once and reused across all layers.