Architecture

DeltaNet

2021ActivePublished: 7 June 2026Updated: 7 June 2026Published

Key innovation

Replaces the additive state update of linear attention with the delta rule — at each step the model stores the prediction error and corrects the state in a targeted way, while preserving linear complexity.

How it works

1) Each token produces (q_t, k_t, v_t) and an optional learning-rate coefficient β_t ∈ (0, 1]. 2) The key passes through a feature map φ(·), typically with extra L2 normalisation. 3) The model reads the current mapping: ṽ_t = S_{t-1} · φ(k_t). 4) Computes the error Δ = β_t · (v_t − ṽ_t). 5) Updates the state: S_t = S_{t-1} + Δ · φ(k_t)ᵀ — exactly the 1960s delta rule (Widrow-Hoff) adapted to matrices. 6) Output: y_t = S_t · φ(q_t). 7) Training uses chunkwise parallelisation across sequence length based on Householder-matrix products, which preserves correctness of the delta rule and exploits GPU tensor cores.

Problem solved

Plain (additive) linear attention has limited associative capacity — its state grows monotonically and new key-value pairs do not overwrite previous ones, which leads to poor long-context retrieval. DeltaNet addresses this with online memory correction.

Components

Fast-weights state S_tStores and updates key-value associations via the delta rule.

Matrix holding the current φ(k) → v mapping; acts as the layer's "short-term memory".

INMatrix accumulated over time steps.

OUTState after a rank-1 update.

Delta ruleReplaces the purely additive outer-product update of linear attention.

Update mechanism: compute the error between the current prediction and the target value, then correct the state to reduce that error.

INTarget value, state prediction, and learning rate.

OUTCorrection vector Δ used in the rank-1 update.

Constant βConstant learning rate treated as a hyperparameter.

Learned β_tToken-dependent learning rate (sigmoid of a projection).

Official

Feature map φ(·) with normalisationEnsures new keys are relatively orthogonal, so the delta rule does not erase prior associations.

Mapping for keys/queries — typically SiLU/short-conv + L2 norm, for orthogonality and delta-rule stability.

INQuery/key tensor.

OUTNormalised features.

Official

Implementation

Reference implementations

Flash Linear Attention (DeltaNet)

Python · fla-org

Official

sustcsonglin/flash-linear-attention (early DeltaNet release)

Python · Songlin Yang

Official

Implementation pitfalls

Instability without L2 feature normalisationHigh

Without L2 norm on φ(k) the delta rule can numerically diverge and erase previously stored associations.

Fix:Use L2 normalisation on features and a careful choice of β_t (typically via a sigmoid).

Choice of β_tMedium

Too-large β overwrites fresh associations; too-small β makes the state effectively static.

Fix:Learn β_t per token (sigmoid of a projection); calibrate at small scale.

Complexity of the Householder-chunkwise algorithmMedium

A naive DeltaNet implementation does not parallelise across sequence length; the Yang et al. algorithm requires careful kernel implementation.

Fix:Use proven kernels from the FLA library rather than writing them from scratch.

Evolution

Original paper · 2021 · ICML 2021 · Imanol Schlag

Linear Transformers Are Secretly Fast Weight Programmers

Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber

2021

Schlag, Irie, Schmidhuber — DeltaNet and fast weight programmers

Inflection point

Showed the formal equivalence of linear attention with fast weight programmers and introduced the delta rule as an alternative to additive updates.

Linear Transformers Are Secretly Fast Weight Programmers (paper)

2024

Yang et al. — parallelizing DeltaNet over sequence length

Inflection point

Hardware-efficient training algorithm for DeltaNet based on Householder-matrix products; scaled to 1.3B parameters / 100B tokens; better perplexity than Mamba and GLA.

Parallelizing Linear Transformers with the Delta Rule over Sequence Length (paper)

2024

Gated DeltaNet — gating + delta rule

Combination of gating (fast memory erasure) with the delta rule (precise corrections); accepted at ICLR 2025.

Gated Delta Networks: Improving Mamba2 with Delta Rule (paper)

2025

Adoption in production language models

DeltaNet / Gated DeltaNet layers adopted in models such as Qwen3-Next and OLMo Hybrid hybrids.