Compressive Transformer extends the Transformer-XL memory loop with a third hierarchy level: (1) Current segment (T tokens) — full attention, queries+keys+values active. (2) Short-term memory M (M tokens) — hidden states of the last N segments, as in XL, acting as keys/values for the current segment. (3) Long-term compressed memory M_c (M_c tokens) — when the oldest segment is about to be evicted from the M buffer, instead of being discarded it is compressed by a function f_compress: R^{c×d} → R^{1×d}, where c is the compression rate (typically 3 or 4). Compressed tokens land in the M_c buffer, which also acts as keys/values for the current attention. f_compress functions tested in the paper: (a) 1D mean pooling — average of c consecutive vectors, (b) 1D max pooling — pointwise max, (c) 1D conv (kernel size c, stride c) — learned, best empirical results, (d) dilated conv — wider receptive field, (e) most-used — keep the c tokens with highest cumulative attention from previous queries. Training the compressors: besides standard cross-entropy, the authors introduce an attention-reconstruction loss — original hidden states and their compressed counterparts should produce similar attention patterns for retained queries. This further motivates the compressor to preserve information important for attention. Position: relative PE as in Transformer-XL, but compressed tokens receive a special positional offset proportional to c.
Transformer-XL keeps hidden states of the last N segments in a FIFO buffer and discards older ones. Simple but wasteful — discarded information is irretrievably lost, even though for tasks like book modelling (PG-19) distant references are crucial. Naively increasing M (memory length) scales VRAM linearly with M, impractical for very long sequences. Compressive Transformer solves this: instead of discarding, COMPRESSES — c tokens become 1 token in the long-term buffer. This yields a logarithmic memory hierarchy (fresh → short-term → compressed) at constant total memory cost.
FIFO buffer holding hidden states of the last N segments. Identical semantics to Transformer-XL — the difference is that INSTEAD of being discarded, the oldest segment goes to the compressor.
Function mapping c consecutive hidden states to one. Invoked when the oldest segment is about to be evicted from M. Can be learned (1D conv) or deterministic (pooling, most-used).
Official
A second FIFO buffer holding M_c compressed tokens. Each represents c original tokens. Together with M it forms a memory hierarchy: fresh → short-term → compressed.
Learned compression (1D conv) without the attention-reconstruction auxiliary loss degenerates to identity — the model learns it's easier to ignore compressed tokens than to use them.
Full backprop through all compression steps (e.g. 100 segments back) is memory-infeasible and unstable.
A compressed token represents c originals — if we use relative PE as for regular tokens, the model thinks they are immediately consecutive, confusing attention.
Dai et al. introduce hidden-states cache and relative PE. The direct foundation of Compressive Transformer — single-tier FIFO memory.
An independent long-context path: deterministic sparse patterns. Compressive Transformer and Sparse Transformer emerged in the same year as parallel answers to the same problem.
Rae, Potapenko, Jayakumar, Hillier, Lillicrap publish Compressive Transformer (arXiv:1911.05507). Two-tier memory: M (FIFO) + M_c (compressed). They also introduce the PG-19 dataset — the first systematic long-range LM benchmark on books.
Longformer and BigBird (both 2020) offer simpler long-context without sequentiality and compressors. Compressive Transformer remains theoretically important but less frequently deployed in production.
Extension of the compressed-memory idea to UNBOUNDED memory — kNN lookup in a huge external hidden-states database. A direct heir to Compressive Transformer.
Google publishes Infini-attention — compressed memory built directly into the attention layer without a separate buffer. Mamba and RWKV in turn realise compression via SSM hidden state. All these approaches are conceptually close to Compressive Transformer.
Time complexity: O(T · (T + M + M_c) · d) per segment + O(c · M · d²) per kompresję (rzadko). Space complexity: O((M + M_c) · d · L) cache + O(T · d · L) bieżąca aktywacja.
The training bottleneck is the extra compressor forward pass and auxiliary loss computation. Inference is cheaper — compression runs only once every N segments, and ultimately M_c is ready and merely served as keys/values.
Number of original tokens represented by 1 compressed token. The paper tests c=2, 3, 4. Larger c = longer effective context, but lower precision of preserved information.
Number of compressed tokens in long-term memory. Effective backward reach: M + c·M_c. The paper uses M=512–1024, M_c=512–1024, c=3 — giving effective context on the order of 4000–5000 tokens.
Choice of mechanism compressing c tokens to 1. The paper tests 5 variants: mean pool, max pool, 1D conv, dilated conv, most-used. Best empirically: 1D conv with attention-reconstruction loss.
Auxiliary loss forcing attention computed on compressed keys/values to resemble attention on the originals. Critical for the quality of compression based on learned functions (1D conv, dilated conv).
The "most-used" variant introduces soft selectivity — tokens with highest attention from previous queries are preserved preferentially. A simple form of learning what information is worth keeping.
No learned routing between memory tiers. All keys/values (current + M + M_c) enter the same attention.
Training the compressor requires one forward pass on M tokens every k-th segment + an additional attention-reconstruction loss — overhead on the order of 5–15% over Transformer-XL.
The original DeepMind experiments were trained on TPU. The 1D conv compressor maps perfectly onto the TPU systolic array.
Standard attention + 1D conv compressor work well on GPU. Two-tier memory requires extra VRAM (M·d + M_c·d) — small overhead on modern GPUs.
The algorithm is standard attention + additional compression kernels — portable, but real efficiency requires GPU/TPU.