Architecture

Compressive Transformer

2019HistoricalPublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

Extends Transformer-XL with a SECOND tier of memory — instead of discarding old hidden states from the FIFO buffer, it compresses them with a function (1D conv / pooling / most-used) into a denser representation. Yields ~3–4× longer effective context at comparable VRAM. Introduces the PG-19 dataset (books) as the first systematic long-range language modelling benchmark.

How it works

Compressive Transformer extends the Transformer-XL memory loop with a third hierarchy level: (1) Current segment (T tokens) — full attention, queries+keys+values active. (2) Short-term memory M (M tokens) — hidden states of the last N segments, as in XL, acting as keys/values for the current segment. (3) Long-term compressed memory M_c (M_c tokens) — when the oldest segment is about to be evicted from the M buffer, instead of being discarded it is compressed by a function f_compress: R^{c×d} → R^{1×d}, where c is the compression rate (typically 3 or 4). Compressed tokens land in the M_c buffer, which also acts as keys/values for the current attention. f_compress functions tested in the paper: (a) 1D mean pooling — average of c consecutive vectors, (b) 1D max pooling — pointwise max, (c) 1D conv (kernel size c, stride c) — learned, best empirical results, (d) dilated conv — wider receptive field, (e) most-used — keep the c tokens with highest cumulative attention from previous queries. Training the compressors: besides standard cross-entropy, the authors introduce an attention-reconstruction loss — original hidden states and their compressed counterparts should produce similar attention patterns for retained queries. This further motivates the compressor to preserve information important for attention. Position: relative PE as in Transformer-XL, but compressed tokens receive a special positional offset proportional to c.

Problem solved

Transformer-XL keeps hidden states of the last N segments in a FIFO buffer and discards older ones. Simple but wasteful — discarded information is irretrievably lost, even though for tasks like book modelling (PG-19) distant references are crucial. Naively increasing M (memory length) scales VRAM linearly with M, impractical for very long sequences. Compressive Transformer solves this: instead of discarding, COMPRESSES — c tokens become 1 token in the long-term buffer. This yields a logarithmic memory hierarchy (fresh → short-term → compressed) at constant total memory cost.

Components

Short-term memory (M, FIFO)First memory hierarchy tier — fresh, uncompressed contexts

FIFO buffer holding hidden states of the last N segments. Identical semantics to Transformer-XL — the difference is that INSTEAD of being discarded, the oldest segment goes to the compressor.

INHidden states from N previous segments.

OUTKeys/values available to current attention.

Compression function (f_compress)Bridge between short-term and long-term memory

Function mapping c consecutive hidden states to one. Invoked when the oldest segment is about to be evicted from M. Can be learned (1D conv) or deterministic (pooling, most-used).

INc consecutive hidden states to compress.

OUTCompressed representation.

1D conv (best)Learned convolution with kernel size=c, stride=c. Best empirical results in the paper.

Dilated convDilated convolution — wider receptive field at the same parameter count.

Mean / max poolingNo parameters — simple baseline.

Most-used selectionSelection of c tokens with highest cumulative attention from previous queries — interpretable.

Official

Long-term compressed memory (M_c)Second (oldest) memory hierarchy tier — distant context in compressed form

A second FIFO buffer holding M_c compressed tokens. Each represents c original tokens. Together with M it forms a memory hierarchy: fresh → short-term → compressed.

INCompressed representations of old segments.

OUTKeys/values available to current attention.

Implementation

Reference implementations

lucidrains/compressive-transformer-pytorch

Python (PyTorch) · Phil Wang (lucidrains) — community

DeepMind PG-19 dataset

Python · DeepMind

Official

Implementation pitfalls

Missing attention-reconstruction loss with learned compressionHigh

Learned compression (1D conv) without the attention-reconstruction auxiliary loss degenerates to identity — the model learns it's easier to ignore compressed tokens than to use them.

Fix:Always apply attention-reconstruction loss with a weight on the order of 0.1–1.0 relative to the main cross-entropy loss.

Backprop through compression function without stop-gradient on memoryHigh

Full backprop through all compression steps (e.g. 100 segments back) is memory-infeasible and unstable.

Fix:Apply stop-gradient on M_c after each compression — gradient propagates only through the most recent compression operation, not through the entire history.

Wrong positional offset for compressed tokensMedium

A compressed token represents c originals — if we use relative PE as for regular tokens, the model thinks they are immediately consecutive, confusing attention.

Fix:Apply a positional offset proportional to compression rate c — see appendix A.3 of the original paper.

Evolution

Original paper · 2019 · ICLR 2020 (DeepMind) · Jack W. Rae

Compressive Transformers for Long-Range Sequence Modelling

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, Timothy P. Lillicrap

2019

Transformer-XL — segment recurrence + relative PE

Dai et al. introduce hidden-states cache and relative PE. The direct foundation of Compressive Transformer — single-tier FIFO memory.

Transformer-XL (concept)

2019

Sparse Transformer (Child et al., OpenAI) — parallel work

An independent long-context path: deterministic sparse patterns. Compressive Transformer and Sparse Transformer emerged in the same year as parallel answers to the same problem.

Sparse Transformer (concept)

2019

Compressive Transformer — DeepMind paper

Inflection point

Rae, Potapenko, Jayakumar, Hillier, Lillicrap publish Compressive Transformer (arXiv:1911.05507). Two-tier memory: M (FIFO) + M_c (compressed). They also introduce the PG-19 dataset — the first systematic long-range LM benchmark on books.

Compressive Transformers for Long-Range Sequence Modelling (paper)

2020

Longformer / BigBird — sparse attention wins popularity

Longformer and BigBird (both 2020) offer simpler long-context without sequentiality and compressors. Compressive Transformer remains theoretically important but less frequently deployed in production.

2022

Memorizing Transformers (Wu et al., Google)

Extension of the compressed-memory idea to UNBOUNDED memory — kNN lookup in a huge external hidden-states database. A direct heir to Compressive Transformer.

2024

Infini-attention (Google) and SSM-attention hybrids

Google publishes Infini-attention — compressed memory built directly into the attention layer without a separate buffer. Mamba and RWKV in turn realise compression via SSM hidden state. All these approaches are conceptually close to Compressive Transformer.

Compressive Transformer

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements