1) Each token produces (q_t, k_t, v_t) and an optional learning-rate coefficient β_t ∈ (0, 1]. 2) The key passes through a feature map φ(·), typically with extra L2 normalisation. 3) The model reads the current mapping: ṽ_t = S_{t-1} · φ(k_t). 4) Computes the error Δ = β_t · (v_t − ṽ_t). 5) Updates the state: S_t = S_{t-1} + Δ · φ(k_t)ᵀ — exactly the 1960s delta rule (Widrow-Hoff) adapted to matrices. 6) Output: y_t = S_t · φ(q_t). 7) Training uses chunkwise parallelisation across sequence length based on Householder-matrix products, which preserves correctness of the delta rule and exploits GPU tensor cores.
Plain (additive) linear attention has limited associative capacity — its state grows monotonically and new key-value pairs do not overwrite previous ones, which leads to poor long-context retrieval. DeltaNet addresses this with online memory correction.
Matrix holding the current φ(k) → v mapping; acts as the layer's "short-term memory".
Update mechanism: compute the error between the current prediction and the target value, then correct the state to reduce that error.
Official
Mapping for keys/queries — typically SiLU/short-conv + L2 norm, for orthogonality and delta-rule stability.
Official
Without L2 norm on φ(k) the delta rule can numerically diverge and erase previously stored associations.
Too-large β overwrites fresh associations; too-small β makes the state effectively static.
A naive DeltaNet implementation does not parallelise across sequence length; the Yang et al. algorithm requires careful kernel implementation.
Showed the formal equivalence of linear attention with fast weight programmers and introduced the delta rule as an alternative to additive updates.
Hardware-efficient training algorithm for DeltaNet based on Householder-matrix products; scaled to 1.3B parameters / 100B tokens; better perplexity than Mamba and GLA.
Combination of gating (fast memory erasure) with the delta rule (precise corrections); accepted at ICLR 2025.
DeltaNet / Gated DeltaNet layers adopted in models such as Qwen3-Next and OLMo Hybrid hybrids.
Time complexity: O(n · d²). Space complexity: O(d²).
The Yang et al. (2024) algorithm removes the delta rule's sequentiality via an efficient Householder-product representation; this is the dominant cost besides d × d matmuls.
How β_t is chosen — constant, sigmoid projection, layer-wise schedule.
Choice of φ together with optional (L2) normalisation.
Chunk size in chunkwise training — parallelism vs memory trade-off.
Number of independent delta-rule heads — affects associative capacity.
All tokens update the state; selectivity comes only from β_t or, in variants, from gates.
DeltaNet has no expert routing; some variants (Gated DeltaNet) add a gating mechanism.
Training fully exploits GPUs; inference uses the recurrent form.
The chunkwise form with Householder products maps onto large matmuls that exploit Tensor Cores.
Linear scaling and regular access patterns suit systolic MAC arrays.
Autoregressive inference with constant-size state is feasible, but throughput is bounded by delta-rule computation.