Architecture

BigBird

2020ActivePublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

Combines three complementary attention sparsification patterns — local sliding window, random connections, and global tokens — into a single O(T) per-layer mechanism, for which the authors formally prove it is a universal approximator of sequence functions and Turing-complete, just like full attention.

How it works

BigBird builds the sparse attention matrix from three components: (1) Window attention — each token attends to W neighbours (like SWA; in the paper W=3, i.e. ±1), (2) Random attention — each token additionally attends to R randomly chosen keys (R=2–3 in the paper); from graph theory perspective this adds random edges to the attention graph, drastically shortening the average distance between any two tokens, (3) Global attention — g selected tokens attend to all and all attend to them (g typically 2–8, e.g. [CLS] + question tokens in QA). In total each layer performs O((W+R)·T + g·T) ≈ O(T) operations. Implementation-wise BigBird reorganises the sequence into blocks — random attention is sparse within pre-permuted blocks to retain GPU efficiency. The authors publish two variants: ETC (Extended Transformer Construction) — without random attention, only window + global; and full BigBird ITC (Internal Transformer Construction) with all three. Random attention is critical for the theoretical proofs.

Problem solved

Earlier sparse attention (Longformer/SWA, Sparse Transformer) worked empirically but lacked theoretical expressivity guarantees — it was unclear whether restricting attention to a window and a few global tokens stripped the model of fundamental capabilities. BigBird formalises the problem: it proves (1) SWA + global + random is a universal approximator of sequence functions, (2) it is Turing-complete, and (3) the window must have a lower bound on the number of global tokens of O(√T) for these properties to hold. Practically: it lets a Transformer safely scale to 4096–8192 tokens (8×–16× over BERT) without quality loss.

Components

Window attention (W)Local coherence, token neighbourhood

Each token attends to its W nearest neighbours. Mechanism taken directly from SWA/Longformer — responsible for preserving local coherence.

INLocal-window (query, key) pairs.

OUTLocal value aggregation.

Official

Random attention (R)Global information propagation in O(log T) hops

Each token attends to R randomly chosen positions (with a fixed seed). From a graph-theoretic perspective, this adds random edges to the attention graph, reducing the average distance between any two tokens to O(log T) — crucial for the universality proof.

INRandom (query, key) pairs per token.

OUTRandomly sampled value aggregation.

ITC random (canonical BigBird)R=2–3 random connections per token with block-wise permutation.

ETC without randomRandom component dropped — easier implementation, weaker guarantees.

Global attention (g tokens)Global bridge — guarantees any information can reach any position in a single hop

g selected tokens attend to the entire sequence and are visible to all other tokens. In practice [CLS], [SEP], and/or question tokens in QA. Theoretical lower bound: g = Ω(√T).

INBidirectional full attention for g tokens.

OUTValues enriched with the global signal.

Official

Implementation

Reference implementations

google-research/bigbird (official repo)

Python (TensorFlow / JAX) · Google Research (Zaheer et al.)

Official

Hugging Face Transformers — BigBirdModel / BigBirdPegasus

Python (PyTorch) · Hugging Face / Google

Implementation pitfalls

Skipping random attention (degeneration to Longformer)Medium

Without the random component, BigBird reduces to Longformer (SWA + global) and loses its formal universality guarantees. ETC is a deliberate compromise; accidental omission breaks the semantics.

Fix:If the goal is theoretical properties, use the full ITC variant with R≥2. If only efficiency matters, deliberately pick ETC and document the configuration.

Too few global tokens for long sequencesMedium

The paper formally requires g = Ω(√T) global tokens to preserve expressivity. For T=8192 that is ~90 tokens. Using only g=2 ([CLS]+[SEP]) on very long sequences degrades quality below theoretical predictions.

Fix:Scale the number of global tokens proportionally to √T or use task-specific global tokens (e.g. all question tokens in QA).

Naive random-attention implementation — GPU memory fragmentationHigh

A literal random scatter-gather over the sequence is catastrophic on GPU (random access). Requires block-wise permutation.

Fix:Use the official block-based implementation (block_size=64) instead of token-level random access.

Evolution

Original paper · 2020 · NeurIPS 2020 (Google Research) · Manzil Zaheer

Big Bird: Transformers for Longer Sequences

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed

2019

Sparse Transformer (Child et al., OpenAI)

The first widely cited work on deterministic attention sparsification (local + strided). Empirical results, no theoretical expressivity proofs.

2020

Longformer — SWA + global tokens

Beltagy, Peters, Cohan (AI2) introduce SWA + global attention. They empirically show that this combination works for long-document encoders, but theory is missing.

SWA (concept)

2020

BigBird — formal theory of sparse attention

Inflection point

Zaheer et al. (Google) publish BigBird at NeurIPS 2020. They introduce the third component (random attention) and — crucially — PROVE that SWA + global + random preserves universal approximation and Turing completeness of a standard Transformer. The first theoretical justification of sparse attention.

Big Bird: Transformers for Longer Sequences (paper)

2021

BigBird-Pegasus, BigBird-Roberta — production checkpoints

Google releases trained BigBird models on Hugging Face (based on RoBERTa and Pegasus) for QA and summarisation tasks. Support added to the Transformers library.

2023

Decline of BigBird in new large LLMs

Newer large LLMs (Mistral, Mixtral, Gemma) choose plain SWA without random attention — random turns out harder to implement efficiently on GPU and empirically L·W (depth × window) suffices. BigBird remains an important theoretical reference.

SWA (concept)

BigBird

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements