Architecture

Sparse Transformer

2019HistoricalPublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

The first widely recognised sparse attention for autoregressive Transformers: reduces attention cost from O(T²) to O(T·√T) via factorised, deterministic patterns (strided and fixed), with a proof that any two tokens can "meet" in exactly 2 hops through layers. Initiated the whole family of long-context architectures (Longformer, BigBird, SWA).

How it works

Sparse Transformer splits attention into TWO FACTORISED heads operating in parallel: (1) Local attention — each token i attends to the previous L tokens (where L=√T), (2) Strided/Fixed attention — a second pattern ensuring that at least one position in every chunk of context is visible. Strided variant: token i also attends to positions i-L, i-2L, i-3L, … (every L-th token back). Fixed variant: selected "summary" positions within the sequence (one token per L positions) attend to the entire previous L and are visible to all subsequent — analogous to global tokens in Longformer/BigBird. Key theorem: after one layer token i communicates with O(L)=O(√T) positions. After TWO layers — with the entire sequence (any two tokens share a common neighbour in the attention graph). This guarantees that two layers of sparse composition are functionally equivalent to one dense layer, at O(T·√T) cost instead of O(T²). In practice OpenAI models had 128 layers, giving ample depth for propagation.

Problem solved

Standard self-attention scales as O(T²·d) — for T=12 288 (typical CIFAR-10 64×64 image resolution as a pixel sequence), the attention matrix takes ~600 MB per layer. The practical limit on 16GB GPUs of the 2019 era was ~3000 tokens. Sparse Transformer solves this via deterministic sparse patterns: instead of the full [T, T] matrix, each head only computes [T, √T]. This allowed OpenAI to train autoregressive models for images, raw audio (Wavenet-scale), and MIDI music — previously impossible with dense attention.

Components

Local attention headLocal coherence and short-range dependencies

The first of two factorised heads. Each token attends to L previous positions (causally). Responsible for local precision.

INLocal backward-window (query, key) pairs.

OUTLocal value aggregation.

Official

Strided / Fixed headGlobal propagation in 2 hops

The second factorised head. Strided variant: token attends to positions i-L, i-2L, … providing global communication. Fixed variant: selected summary tokens every L positions are visible to all subsequent ones. Without this head, global propagation would require O(T/L) layers.

IN(Query, key) pairs every L-th token (strided) or to summary tokens (fixed).

OUTGlobal value aggregation in 2 hops.

StridedEvery L-th token back — for periodic data (images, audio).

FixedSummary tokens every L positions — for non-periodic data (text).

Official

Block-sparse CUDA kernelBridge between theoretical sparsity and practical GPU efficiency

Implementation component crucial for efficiency. Operates on fixed-size blocks (typically 32×32 or 64×64), never materialising the full [T, T] matrix. Without it the sparse pattern gives no real saving.

Official

Implementation

Reference implementations

openai/sparse_attention (official repo with block-sparse CUDA kernel)

Python (TensorFlow) / CUDA · OpenAI (Child, Gray)

Official

Triton — sparse attention reimplementations

Python / Triton · OpenAI

Official

Implementation pitfalls

Naive implementation via mask on the full matrixCritical

Implementing the sparse pattern by masking the full [T, T] matrix keeps the O(T²) cost — negating the whole point of the method. Sparse Transformer requires a kernel that operates natively on blocks.

Fix:Use OpenAI's official block-sparse CUDA kernel or a Triton/FlashAttention reimplementation with `sparse_block_pattern`.

Wrong choice of L vs sequence lengthHigh

O(T·√T) optimality holds only for L ≈ √T. Too small L = too many layers needed for global propagation. Too large L = loss of cost savings vs dense.

Fix:For each target length T choose L≈√T (e.g. T=12 288 → L=128, T=4096 → L=64).

Confusing strided vs fixed variant for the wrong data typeMedium

Strided works well for periodic data (images, audio), fixed for non-periodic (text). Using strided for text yields worse results than a simple dense baseline at short contexts.

Fix:For text use the fixed variant (analogous to global tokens in Longformer/BigBird).

Evolution

Original paper · 2019 · arXiv:1904.10509 (OpenAI) · Rewon Child

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever

2017

Transformer (Vaswani et al.) — dense baseline

Original Transformer with O(T²·d) attention. Practical sequence length limit of 512–1024 tokens on 2017–2018 hardware. The starting point for all sparse alternatives.

Transformer (concept)

2019

Transformer-XL — segment recurrence

Dai et al. (CMU/Google) introduce inter-segment recurrence — an alternative long-context approach without modifying the attention matrix itself. Parallel work to Sparse Transformer (publications a few months apart).

2019

Sparse Transformer — OpenAI paper

Inflection point

Child, Gray, Radford, Sutskever publish Sparse Transformer (arXiv:1904.10509). The first practical autoregressive architecture with deterministic sparse attention. Introduces head factorisation, a custom CUDA block-sparse kernel, and 128-layer models. Trains on images, audio, and text up to T=12 288.

Generating Long Sequences with Sparse Transformers (paper)

2020

GPT-3 — sparse layers in a production LLM

OpenAI in GPT-3 (175B) uses alternating dense and sparse layers (a Sparse Transformer variant) — the first deployment of sparse attention in a large production language model.

2020

Longformer (Beltagy et al.) — SWA + global

Direct successor of Sparse Transformer for encoders. Simplifies the pattern to local window + global tokens, dropping the strided component.

SWA (concept)

2020

BigBird (Zaheer et al., Google) — formal theory

Google publishes BigBird combining SWA + global + random and formally proves universality of such combination. Closes the theoretical gap left by the empirical Sparse Transformer.

BigBird (concept)

2023

Mistral 7B — SWA as a full LLM architecture

Mistral AI releases Mistral 7B with causal SWA. Continuation of the Sparse Transformer → Longformer → SWA line, but in a modern large LLM. Sparse Transformer remains a historical reference.

Sparse Transformer

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements