Architecture

Linear Attention

2020ActivePublished: 7 June 2026Updated: 7 June 2026Published

Key innovation

Replaces the costly softmax(QKᵀ) operation with a kernel approximation φ(Q)·(φ(K)ᵀV); via matrix-multiplication associativity this reduces complexity from O(n²·d) to O(n·d²) and enables autoregressive inference as a recurrent update with constant memory.

How it works

1) Choose a feature map φ(·) (e.g. ELU+1, cosine, FAVOR+ orthogonal random features) — non-negativity is important. 2) Instead of computing A = softmax(QKᵀ) and then A·V, compute φ(K)ᵀV (shape d×d), then φ(Q) · (φ(K)ᵀV). 3) The normaliser is φ(Q)·Σ φ(K). 4) In the autoregressive regime, maintain a cumulative state S_t = S_{t−1} + φ(k_t)v_tᵀ and z_t = z_{t−1} + φ(k_t); output: y_t = (φ(q_t)ᵀ S_t) / (φ(q_t)ᵀ z_t). 5) Training uses a parallel (chunkwise / blockwise) form to exploit GPUs and keep parallelism along the sequence dimension.

Problem solved

Standard scaled dot-product attention has O(n²) time and memory complexity in sequence length, which is impractical for very long contexts and expensive in autoregressive inference. Linear Attention breaks the quadratic barrier, enabling training and inference on long sequences while preserving training parallelism and supporting recurrent inference with constant memory.

Components

Feature map φ(·)Approximates softmax and allows factorising the Q·Kᵀ product into linear operations.

A non-negative map applied independently to queries and keys; its choice governs expressiveness and stability. Common choices: ELU+1, cosine, FAVOR+ (orthogonal random features).

INQuery/key tensor.

OUTTensor after applying φ.

ELU+1Simple non-negative map used in the original Linear Transformer (Katharopoulos et al., 2020).

FAVOR+Softmax approximation via orthogonal random features (Performer, 2020).

Cosine / sin-cosCosine map used e.g. in cosFormer.

Official

Recurrent state S_tReplaces the K/V cache of classical attention with constant memory of size d_φ × d_v.

Matrix accumulating outer products φ(k_t)v_tᵀ; serves as "memory" in the autoregressive regime.

INMatrix accumulated over time steps.

OUTState after update.

Normaliser z_tStabilises the output scale and preserves a probabilistic interpretation.

Vector summing φ(k_t), used to normalise the output similarly to the softmax denominator.

INVector accumulated over time.

OUTCumulative sum of key features up to time t.

Official

Implementation

Reference implementations

fast-transformers (linear attention)

Python · Idiap Research Institute

Official

Performer-pytorch

Python · Phil Wang (lucidrains)

Flash Linear Attention (FLA)

Python · fla-org

Implementation pitfalls

Numerical instability of the denominatorHigh

The sum φ(K) can approach zero or very small values early in the sequence, leading to division by near-zero.

Fix:Add an ε to the denominator, use layer normalisation, choose φ carefully.

Weaker performance on retrieval tasksMedium

Pure linear attention struggles with precise long-range recall because its state is a compressed sum.

Fix:Add the delta rule (DeltaNet) or gates (Gated Linear Attention); use hybrids with local attention.

Feature-map selectionMedium

A poor choice of φ degrades expressiveness or training stability.

Fix:Use proven maps (ELU+1, FAVOR+, cosine); calibrate at small scale before a full training run.

Evolution

Original paper · 2020 · ICML 2020 · Angelos Katharopoulos

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret

2020

Linear Transformer (Katharopoulos et al.)

Inflection point

Introduction of the kernel attention form with φ = ELU+1; demonstration of equivalence to an RNN in the autoregressive regime.

Transformers are RNNs (paper)

2020

Performer / FAVOR+

Softmax approximation via orthogonal random features; theoretical error guarantees.

Rethinking Attention with Performers (paper)

2023

RetNet

Hybrid of parallel and recurrent forms with exponential decay; demonstrated scalability to large language models.

Retentive Network: A Successor to Transformer for Large Language Models (paper)

2024

Mamba2 / SSD — bridge to Linear Attention

Inflection point

"Transformers are SSMs" shows that selective SSMs and linear attention are two sides of the same structured-matrix duality.

SSM (concept)Transformers are SSMs (paper)

2024

DeltaNet & Gated DeltaNet

Linear attention augmented with the delta rule and gating; significant improvements on retrieval and long-context tasks.

Gated Delta Networks: Improving Mamba2 with Delta Rule (paper)

Sources

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Paper

arXiv

Originating paper introducing the term Linear Transformer / Linear Attention.

Rethinking Attention with Performers

Paper

arXiv

Performer and FAVOR+: softmax approximation via random features.

Flash Linear Attention (GitHub)

Repository

fla-org

Library with efficient kernels for various linear-attention variants.

Linear Attention

How it works

Problem solved

Components

Implementation

Evolution

Sources

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements