BigBird builds the sparse attention matrix from three components: (1) Window attention — each token attends to W neighbours (like SWA; in the paper W=3, i.e. ±1), (2) Random attention — each token additionally attends to R randomly chosen keys (R=2–3 in the paper); from graph theory perspective this adds random edges to the attention graph, drastically shortening the average distance between any two tokens, (3) Global attention — g selected tokens attend to all and all attend to them (g typically 2–8, e.g. [CLS] + question tokens in QA). In total each layer performs O((W+R)·T + g·T) ≈ O(T) operations. Implementation-wise BigBird reorganises the sequence into blocks — random attention is sparse within pre-permuted blocks to retain GPU efficiency. The authors publish two variants: ETC (Extended Transformer Construction) — without random attention, only window + global; and full BigBird ITC (Internal Transformer Construction) with all three. Random attention is critical for the theoretical proofs.
Earlier sparse attention (Longformer/SWA, Sparse Transformer) worked empirically but lacked theoretical expressivity guarantees — it was unclear whether restricting attention to a window and a few global tokens stripped the model of fundamental capabilities. BigBird formalises the problem: it proves (1) SWA + global + random is a universal approximator of sequence functions, (2) it is Turing-complete, and (3) the window must have a lower bound on the number of global tokens of O(√T) for these properties to hold. Practically: it lets a Transformer safely scale to 4096–8192 tokens (8×–16× over BERT) without quality loss.
Each token attends to its W nearest neighbours. Mechanism taken directly from SWA/Longformer — responsible for preserving local coherence.
Official
Each token attends to R randomly chosen positions (with a fixed seed). From a graph-theoretic perspective, this adds random edges to the attention graph, reducing the average distance between any two tokens to O(log T) — crucial for the universality proof.
g selected tokens attend to the entire sequence and are visible to all other tokens. In practice [CLS], [SEP], and/or question tokens in QA. Theoretical lower bound: g = Ω(√T).
Official
Without the random component, BigBird reduces to Longformer (SWA + global) and loses its formal universality guarantees. ETC is a deliberate compromise; accidental omission breaks the semantics.
The paper formally requires g = Ω(√T) global tokens to preserve expressivity. For T=8192 that is ~90 tokens. Using only g=2 ([CLS]+[SEP]) on very long sequences degrades quality below theoretical predictions.
A literal random scatter-gather over the sequence is catastrophic on GPU (random access). Requires block-wise permutation.
The first widely cited work on deterministic attention sparsification (local + strided). Empirical results, no theoretical expressivity proofs.
Beltagy, Peters, Cohan (AI2) introduce SWA + global attention. They empirically show that this combination works for long-document encoders, but theory is missing.
Zaheer et al. (Google) publish BigBird at NeurIPS 2020. They introduce the third component (random attention) and — crucially — PROVE that SWA + global + random preserves universal approximation and Turing completeness of a standard Transformer. The first theoretical justification of sparse attention.
Google releases trained BigBird models on Hugging Face (based on RoBERTa and Pegasus) for QA and summarisation tasks. Support added to the Transformers library.
Newer large LLMs (Mistral, Mixtral, Gemma) choose plain SWA without random attention — random turns out harder to implement efficiently on GPU and empirically L·W (depth × window) suffices. BigBird remains an important theoretical reference.
Time complexity: O((W + R) · T · d + g · T · d) ≈ O(T · d) per warstwa. Space complexity: O(T · (W + R + g) · d) aktywacji + O(T · d) KV cache.
The real bottleneck in BigBird is the memory access pattern of random attention — without block-wise permutation it would be catastrophic on GPU. With permutation it is acceptable but still slower than pure SWA. The main reason newer LLMs (Mistral, Gemma) chose plain SWA.
Number of local neighbours per token. Very small in the original paper (W=3, i.e. ±1) — the BigBird design assumes expressivity is achieved mostly via random+global, not via a large window.
Number of random keys each token attends to. Small but non-zero — critical for the theoretical proofs. The paper uses R=2–3.
Number of tokens with full attention. The theoretical lower bound is O(√T). In practice typically 2–32 special tokens (e.g. [CLS], question tokens).
ETC: window + global, no random — easier to implement, used in encoder QA. ITC: full window + random + global — the variant with strong theoretical guarantees.
Implementation parameter: random attention is realised at the block level (not single tokens) to keep memory access contiguous. The paper uses a block size of 64.
Global tokens operate in "dense" mode (attention over the whole sequence), the rest of tokens in "sparse" mode — this is a mixed variant.
No learned routing. Window and global are deterministic, random is random (with a fixed seed) but not learned.
All three streams (window, random, global) are parallel within a layer. Random attention is implemented via a block-wise sequence permutation to keep memory access sequential (efficient on GPU).
Random attention requires careful block-based implementation to avoid random VRAM access. With 64×64 blocks BigBird achieves good tensor core utilisation, though worse than dense or plain SWA.
The original Google implementation was optimised for TPUs — block-based random attention maps very well onto the TPU systolic array structure.
The algorithm itself works anywhere, but real efficiency requires block-based kernels — on CPU/AVX BigBird is slower than full attention for T<2048.