1) The starting point is a continuous linear state-space system with parameters (A, B, C, D) where the hidden state x(t) evolves linearly as a function of input u(t). 2) The system is discretized with a step size Delta (e.g. zero-order hold or bilinear), yielding the recurrence x_k = A_bar x_{k-1} + B_bar u_k and y_k = C x_k. 3) With suitable structure on A (e.g. diagonal plus low-rank, HiPPO-LegS) the recurrence can be expressed as a global 1D convolution with a kernel K, computed efficiently via FFT. 4) Training runs in convolutional mode (parallel over the sequence) while autoregressive inference runs in recurrent mode with O(1) memory per token. 5) In Mamba the matrices B, C and step Delta become input-dependent (selective SSM), removing the LTI restriction and requiring a custom hardware-aware selective scan that keeps state in GPU SRAM.
The quadratic cost of self-attention in Transformers (O(L^2) in time and memory) limits context scaling to tens of thousands of tokens and makes million-step sequence modeling (audio, genomics, long documents, robot control) prohibitively expensive. SSMs offer linear time complexity in sequence length and constant memory per token during autoregressive inference.
Matrix defining hidden-state dynamics. S4 uses a HiPPO-LegS (diagonal plus low-rank) structure, S4D/S5 use a diagonal variant which substantially simplifies computation.
Learned (or input-dependent) parameter that converts the continuous system into a discrete recurrence via zero-order hold or bilinear transform.
B projects the input u_k into state space, C reads the output from the hidden state. In Mamba both become input-dependent (selective SSM).
Hardware-aware parallel-scan recurrence that keeps state in GPU SRAM, required when parameters are input-dependent and convolutional mode is unavailable.
A Delta step initialized outside the recommended log-uniform range causes activations to explode or the state to die, especially in deep SSM stacks.
When B, C or Delta depend on the input (Mamba), there is no global convolutional kernel — a naive recurrent training loop is slow.
A small hidden state (e.g. N=16) limits exact copying of long context spans, where Transformers with a KV-cache remain stronger.
Gu et al. introduce HiPPO, a framework for state operators that reconstruct the input signal in an orthogonal basis. Theoretical foundation of later SSMs.
First deep-learning SSM layer exposing the recurrent-convolutional duality; computationally expensive in practice.
Introduction of the DPLR structure and frequency-domain parameterization; SOTA on Long Range Arena, sequences up to 16k tokens.
Showed that a diagonal A matrix matches full S4 quality with a much simpler implementation.
Hungry Hungry Hippos demonstrates SSMs becoming competitive with Transformers on language modeling.
Gu and Dao introduce input-dependent B, C, Delta and a hardware-aware selective scan; the first SSM to match Transformers on language benchmarks at linear complexity.
Mamba-2 unifies SSMs and attention via State Space Duality; Jamba (AI21) and Zamba mix Mamba and Transformer layers in production-scale LLMs.
Time complexity: O(L) inferencja rekurencyjna; O(L log L) trening konwolucyjny (FFT); O(L) trening Mamby z selective scan. Space complexity: O(1) na token podczas autoregresyjnej inferencji; O(L) podczas treningu.
Hidden-state dimension per channel. Typically 16-64 in S4/Mamba. Controls long-range memory capacity.
Initialization range of the discretization step Delta (typically log-uniform). Critical for stability and for modeling dependencies at different time scales.
Choice of A structure: HiPPO-LegS (S4), diagonal (S4D, S5, Mamba), DPLR. Affects efficiency and memory properties.
Whether B, C, Delta are input-dependent (Mamba) or fixed (S4/S4D/S5).
Classical SSMs (S4, S4D) are LTI — parameters do not depend on input. Mamba introduces input dependence (selectivity) while remaining computationally dense.
In convolutional mode the SSM is fully parallel across the sequence at training time. In recurrent mode (autoregressive inference) it is inherently sequential. Mamba recovers training parallelism via a parallel scan.
Mamba and S4 ship official CUDA kernels (selective scan, FFT) that exploit GPU SRAM and Tensor Cores.
Constant per-token state memory makes SSMs attractive for CPU/edge inference; there are no official AVX kernels as optimized as the GPU ones.