Transformer-XL comprises two fundamental mechanisms: (1) Segment-level recurrence — when processing segment τ+1 the model receives as input the original tokens of segment τ+1 AND the hidden states from each layer of the previous segment τ (frozen, with stop-gradient). The segment-τ hidden states form an extended "memory" set of keys/values that new segment-τ+1 queries can attend to. Effective context grows from T (segment length) to T·N (where N is the number of retained previous segments) — at O(T²) attention cost for the new segment (instead of O((T·N)²)). (2) Relative positional encoding — the authors derive a special attention score form: A_ij = Q_i·K_j + Q_i·R_{i-j} + u·K_j + v·R_{i-j}, where R are learned embeddings of RELATIVE distances i-j (sinusoidal, but added like a^K embeddings from RPR). The four-term decomposition isolates positional from content effects, and R does not depend on absolute segment position — making the recurrence consistent. Implementation-wise, segment-τ hidden states are kept in a GPU memory buffer; with each new segment the oldest are overwritten (FIFO).
A standard autoregressive Transformer splits long text into fixed-length segments (e.g. 512 tokens) and treats each independently — leading to two problems: (1) "context fragmentation" — the first tokens of a segment have no context from the previous one, (2) maximum effective context is bounded by a single segment length. Naively extending to longer segments grows attention memory quadratically. Transformer-XL solves this: instead of lengthening the segment, it adds inter-segment recurrence with cached states. The flip side is broken absolute positional encoding — the token at position 0 of the new segment and the token at position 0 of the old segment share the same absolute position, confusing attention. Hence the paper introduces a relative PE specifically tailored to the recurrence.
FIFO buffer holding hidden states from N previous segments for each layer. Stop-gradient isolates it from backprop. Loaded as keys/values for the new segment.
Official
Four-term decomposition of the attention score with two learned vectors u, v and relative-distance embeddings R. Necessary for segment recurrence to remain positionally consistent.
Official
An implementation without stop-gradient on cached states is unstable — the gradient propagates many segments back, which explodes memory and can cause training divergence.
Combining standard absolute PE (sinusoidal/learned) with segment recurrence confuses attention — tokens at position 0 of the new and old segments share identical PE. Relative PE is required.
For M ≪ T the cache adds no meaningful context extension — the effect is comparable to baseline. Optimal M ≈ T to M ≈ 5·T.
The original Transformer splits text into fixed-length segments and treats them independently, creating the context fragmentation problem.
RPR shows that position can be modelled as distance instead of as an absolute index. Direct precursor of Transformer-XL's relative PE.
Dai, Yang, Yang, Carbonell, Le, Salakhutdinov publish Transformer-XL at ACL 2019. They introduce segment-level recurrence and a new four-term relative PE form. SOTA on enwik8, text8, WikiText-103, One Billion Word.
XLNet (with significant author overlap) uses Transformer-XL as its backbone and adds permutation language modelling. One of the most prominent post-BERT breakthroughs.
DeepMind extends Transformer-XL with COMPRESSIBLE memory — older hidden states are compressed (rather than discarded), extending effective context multifold. A direct successor.
After Sparse Transformer (2019), Longformer/BigBird (2020), and RoPE (2021) appeared, the Transformer-XL recurrent approach became rarer in new large LLMs — most models choose a longer window + sparse/RoPE over recurrence + relative PE.
The idea of hidden "memory" states passed between sequence steps returns in SSM architectures (Mamba) and RWKV — implementation differs from Transformer-XL's cached hidden states, but the "recurrent memory alongside attention" intuition comes straight from 2019.
Time complexity: O(T · (T + M) · d) per segment. Space complexity: O(M · d · L) cache + O(T · d · L) bieżąca aktywacja.
The main bottleneck is sequential segment processing — segment τ+1 requires segment τ to finish. This prevents trivial parallelisation along the sequence dimension during training.
Number of tokens processed in one forward pass. The paper tests T=64–640 across various benchmarks.
Number of previous tokens (from earlier segments) kept in cache and visible to the current segment. M=T·N where N is the number of retained segments.
Four-term decomposition: score = (Q_i + u)·K_j + (Q_i + v)·R_{i-j} with two learned vectors u, v. Sinusoidal vs learned R affects extrapolation.
Whether backprop propagates through the previous segment's hidden states. The paper always uses stop-gradient — full backprop through the cache is unstable and expensive.
Attention within a single segment is dense (every token vs every current + cached one). Recurrence introduces conditionality at the segment level, not at the single-token level.
No learned routing — segment recurrence is a deterministic FIFO memory buffer policy.
A conscious trade-off: we sacrifice perfect parallelism across the segment stream for long effective context at constant attention memory budget.
Standard dense attention within a segment maps well onto GPUs. The hidden-states cache requires extra VRAM proportional to M·d·layers.
The original One Billion Word experiment was trained on TPU — segment recurrence pairs well with TPU pipelining.
The algorithm is standard attention + a memory buffer — works on any hardware. Real efficiency requires GPU/TPU with sufficient VRAM.