Sparse Transformer splits attention into TWO FACTORISED heads operating in parallel: (1) Local attention — each token i attends to the previous L tokens (where L=√T), (2) Strided/Fixed attention — a second pattern ensuring that at least one position in every chunk of context is visible. Strided variant: token i also attends to positions i-L, i-2L, i-3L, … (every L-th token back). Fixed variant: selected "summary" positions within the sequence (one token per L positions) attend to the entire previous L and are visible to all subsequent — analogous to global tokens in Longformer/BigBird. Key theorem: after one layer token i communicates with O(L)=O(√T) positions. After TWO layers — with the entire sequence (any two tokens share a common neighbour in the attention graph). This guarantees that two layers of sparse composition are functionally equivalent to one dense layer, at O(T·√T) cost instead of O(T²). In practice OpenAI models had 128 layers, giving ample depth for propagation.
Standard self-attention scales as O(T²·d) — for T=12 288 (typical CIFAR-10 64×64 image resolution as a pixel sequence), the attention matrix takes ~600 MB per layer. The practical limit on 16GB GPUs of the 2019 era was ~3000 tokens. Sparse Transformer solves this via deterministic sparse patterns: instead of the full [T, T] matrix, each head only computes [T, √T]. This allowed OpenAI to train autoregressive models for images, raw audio (Wavenet-scale), and MIDI music — previously impossible with dense attention.
The first of two factorised heads. Each token attends to L previous positions (causally). Responsible for local precision.
Official
The second factorised head. Strided variant: token attends to positions i-L, i-2L, … providing global communication. Fixed variant: selected summary tokens every L positions are visible to all subsequent ones. Without this head, global propagation would require O(T/L) layers.
Official
Implementation component crucial for efficiency. Operates on fixed-size blocks (typically 32×32 or 64×64), never materialising the full [T, T] matrix. Without it the sparse pattern gives no real saving.
Official
Implementing the sparse pattern by masking the full [T, T] matrix keeps the O(T²) cost — negating the whole point of the method. Sparse Transformer requires a kernel that operates natively on blocks.
O(T·√T) optimality holds only for L ≈ √T. Too small L = too many layers needed for global propagation. Too large L = loss of cost savings vs dense.
Strided works well for periodic data (images, audio), fixed for non-periodic (text). Using strided for text yields worse results than a simple dense baseline at short contexts.
Original Transformer with O(T²·d) attention. Practical sequence length limit of 512–1024 tokens on 2017–2018 hardware. The starting point for all sparse alternatives.
Dai et al. (CMU/Google) introduce inter-segment recurrence — an alternative long-context approach without modifying the attention matrix itself. Parallel work to Sparse Transformer (publications a few months apart).
Child, Gray, Radford, Sutskever publish Sparse Transformer (arXiv:1904.10509). The first practical autoregressive architecture with deterministic sparse attention. Introduces head factorisation, a custom CUDA block-sparse kernel, and 128-layer models. Trains on images, audio, and text up to T=12 288.
OpenAI in GPT-3 (175B) uses alternating dense and sparse layers (a Sparse Transformer variant) — the first deployment of sparse attention in a large production language model.
Direct successor of Sparse Transformer for encoders. Simplifies the pattern to local window + global tokens, dropping the strided component.
Google publishes BigBird combining SWA + global + random and formally proves universality of such combination. Closes the theoretical gap left by the empirical Sparse Transformer.
Mistral AI releases Mistral 7B with causal SWA. Continuation of the Sparse Transformer → Longformer → SWA line, but in a modern large LLM. Sparse Transformer remains a historical reference.
Time complexity: O(T · √T · d) per warstwa (zamiast O(T² · d) dense). Space complexity: O(T · √T · d) aktywacji + O(T · d) KV.
Without an efficient block-sparse kernel the entire O(T·√T) saving is lost. The largest historical limitation — required custom CUDA, which for a long time was a deployment barrier.
Local window size and strided spacing. Optimally L ≈ √T so each head costs O(T·√T). For T=12 288 that gives L≈128.
Strided: even spacing every L positions — works well for data with strong periodic structure (images, audio). Fixed: summary tokens every L positions — works better for non-periodic data (text).
How to split the two patterns across MHA heads. The paper tests variants: (a) all heads run both patterns sequentially, (b) half heads local, half strided, (c) interleaved per layer.
The paper introduces optimised gradient checkpointing for long sequences — enables training 128-layer models on 12 288-token sequences on a single V100.
OpenAI in the paper also publishes a custom CUDA kernel ("block-sparse attention") for efficient GPU execution — operates on fixed-size blocks.
No learned routing — sparsity is deterministic (geometric strided/fixed patterns).
Both factorised heads (local and strided/fixed) are parallel. Sparse patterns are deterministic and statically known, so they can be precomputed before training.
Sparse Transformer was designed alongside a custom block-sparse CUDA kernel for V100. The block-based nature of the pattern maps well onto tensor cores.
Without a block-sparse kernel the method is slower than dense for T<2048. Real benefit requires a dedicated low-level implementation.