Architecture

SSM

2021ActivePublished: 7 June 2026Updated: 7 June 2026Published

Key innovation

Linear-time O(L) sequence modeling via continuous linear state-space systems with efficient discretization, unifying a recurrent inference mode with a convolutional training mode as an alternative to quadratic self-attention.

How it works

1) The starting point is a continuous linear state-space system with parameters (A, B, C, D) where the hidden state x(t) evolves linearly as a function of input u(t). 2) The system is discretized with a step size Delta (e.g. zero-order hold or bilinear), yielding the recurrence x_k = A_bar x_{k-1} + B_bar u_k and y_k = C x_k. 3) With suitable structure on A (e.g. diagonal plus low-rank, HiPPO-LegS) the recurrence can be expressed as a global 1D convolution with a kernel K, computed efficiently via FFT. 4) Training runs in convolutional mode (parallel over the sequence) while autoregressive inference runs in recurrent mode with O(1) memory per token. 5) In Mamba the matrices B, C and step Delta become input-dependent (selective SSM), removing the LTI restriction and requiring a custom hardware-aware selective scan that keeps state in GPU SRAM.

Problem solved

The quadratic cost of self-attention in Transformers (O(L^2) in time and memory) limits context scaling to tens of thousands of tokens and makes million-step sequence modeling (audio, genomics, long documents, robot control) prohibitively expensive. SSMs offer linear time complexity in sequence length and constant memory per token during autoregressive inference.

Components

State transition matrix ACarries information through time and governs long-range memory properties.

Matrix defining hidden-state dynamics. S4 uses a HiPPO-LegS (diagonal plus low-rank) structure, S4D/S5 use a diagonal variant which substantially simplifies computation.

Discretization step DeltaControls how strongly new input influences the state and how long information is retained.

Learned (or input-dependent) parameter that converts the continuous system into a discrete recurrence via zero-order hold or bilinear transform.

Input/output projection matrices B and CInput and output of the SSM channel.

B projects the input u_k into state space, C reads the output from the hidden state. In Mamba both become input-dependent (selective SSM).

Selective scan (Mamba)Enables linear-time training of Mamba despite the absence of a convolutional form.

Hardware-aware parallel-scan recurrence that keeps state in GPU SRAM, required when parameters are input-dependent and convolutional mode is unavailable.

Implementation

Reference implementations

state-spaces/s4

Python · HazyResearch / Albert Gu

Official

state-spaces/mamba

Python / CUDA · Albert Gu, Tri Dao

Official

Implementation pitfalls

Unstable discretization with poor Delta initializationHigh

A Delta step initialized outside the recommended log-uniform range causes activations to explode or the state to die, especially in deep SSM stacks.

Fix:Use the recommended log-uniform initialization (e.g. 0.001 to 0.1) and a softplus when deriving Delta from the input, as in S4/Mamba.

No convolutional form in selective SSMHigh

When B, C or Delta depend on the input (Mamba), there is no global convolutional kernel — a naive recurrent training loop is slow.

Fix:Use the official selective_scan_cuda (parallel scan) kernel that keeps state in GPU SRAM.

Weaker copy and retrieval behaviorMedium

A small hidden state (e.g. N=16) limits exact copying of long context spans, where Transformers with a KV-cache remain stronger.

Fix:Use hybrids (Jamba, Zamba) that combine SSM layers with attention for tasks that require precise retrieval.

Evolution

Original paper · 2021 · ICLR 2022 · Albert Gu

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, Christopher Re

2020

HiPPO: a theory of optimal history compression in state

Gu et al. introduce HiPPO, a framework for state operators that reconstruct the input signal in an orthogonal basis. Theoretical foundation of later SSMs.

HiPPO: Recurrent Memory with Optimal Polynomial Projections (paper)

2021

LSSL: the Linear State-Space Layer

First deep-learning SSM layer exposing the recurrent-convolutional duality; computationally expensive in practice.

Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers (paper)

2021

S4: Structured State Space

Inflection point

Introduction of the DPLR structure and frequency-domain parameterization; SOTA on Long Range Arena, sequences up to 16k tokens.

Efficiently Modeling Long Sequences with Structured State Spaces (paper)

2022

S4D: simpler diagonal parameterization

Showed that a diagonal A matrix matches full S4 quality with a much simpler implementation.

On the Parameterization and Initialization of Diagonal State Space Models (paper)

2022

H3: SSM for language modeling

Hungry Hungry Hippos demonstrates SSMs becoming competitive with Transformers on language modeling.

Hungry Hungry Hippos: Towards Language Modeling with State Space Models (paper)

2023

Mamba: selective state spaces

Inflection point

Gu and Dao introduce input-dependent B, C, Delta and a hardware-aware selective scan; the first SSM to match Transformers on language benchmarks at linear complexity.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (paper)

2024

Mamba-2 and hybrid models (Jamba, Zamba)

Mamba-2 unifies SSMs and attention via State Space Duality; Jamba (AI21) and Zamba mix Mamba and Transformer layers in production-scale LLMs.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (paper)