1) Choose a feature map φ(·) (e.g. ELU+1, cosine, FAVOR+ orthogonal random features) — non-negativity is important. 2) Instead of computing A = softmax(QKᵀ) and then A·V, compute φ(K)ᵀV (shape d×d), then φ(Q) · (φ(K)ᵀV). 3) The normaliser is φ(Q)·Σ φ(K). 4) In the autoregressive regime, maintain a cumulative state S_t = S_{t−1} + φ(k_t)v_tᵀ and z_t = z_{t−1} + φ(k_t); output: y_t = (φ(q_t)ᵀ S_t) / (φ(q_t)ᵀ z_t). 5) Training uses a parallel (chunkwise / blockwise) form to exploit GPUs and keep parallelism along the sequence dimension.
Standard scaled dot-product attention has O(n²) time and memory complexity in sequence length, which is impractical for very long contexts and expensive in autoregressive inference. Linear Attention breaks the quadratic barrier, enabling training and inference on long sequences while preserving training parallelism and supporting recurrent inference with constant memory.
A non-negative map applied independently to queries and keys; its choice governs expressiveness and stability. Common choices: ELU+1, cosine, FAVOR+ (orthogonal random features).
Official
Matrix accumulating outer products φ(k_t)v_tᵀ; serves as "memory" in the autoregressive regime.
Vector summing φ(k_t), used to normalise the output similarly to the softmax denominator.
Official
The sum φ(K) can approach zero or very small values early in the sequence, leading to division by near-zero.
Pure linear attention struggles with precise long-range recall because its state is a compressed sum.
A poor choice of φ degrades expressiveness or training stability.
Introduction of the kernel attention form with φ = ELU+1; demonstration of equivalence to an RNN in the autoregressive regime.
Softmax approximation via orthogonal random features; theoretical error guarantees.
Hybrid of parallel and recurrent forms with exponential decay; demonstrated scalability to large language models.
"Transformers are SSMs" shows that selective SSMs and linear attention are two sides of the same structured-matrix duality.
Linear attention augmented with the delta rule and gating; significant improvements on retrieval and long-context tasks.
Time complexity: O(n · d²). Space complexity: O(d²).
The dominant cost is accumulating the state φ(k_t)v_tᵀ and projecting it through φ(q_t). It dominates for small n; the advantage over softmax grows with n.
Choice of non-negative function approximating softmax.
Dimension of the space after mapping with φ; affects state capacity and compute cost.
Chunk size in the chunkwise form — a trade-off between parallelism and memory.
Whether to divide by Σ φ(K) (softmax-like normalisation) or use an unnormalised variant.
All tokens participate in the recurrent state update.
Linear Attention has no routing. Gating/delta mechanisms appear in variants (e.g. Gated Linear Attention, DeltaNet).
Training is parallelised across sequence chunks; inference uses the recurrent form.
The chunkwise form maps to large matmuls that exploit Tensor Cores; no quadratic attention reduces HBM pressure.
Linear scaling and regular access patterns suit systolic MAC arrays.
Autoregressive inference is feasible (constant-size state), but throughput is bounded by φ computations.