Architecture

Transformer-XL

2019ActivePublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

Introduces segment-level recurrence: hidden states from the previous segment are cached (with stop-gradient) and reused as extended context for the next segment. Combined with relative positional encoding required so tokens at the same position in different segments stay distinguishable. The first effective long-context path in autoregressive Transformers beyond a fixed-length window.

How it works

Transformer-XL comprises two fundamental mechanisms: (1) Segment-level recurrence — when processing segment τ+1 the model receives as input the original tokens of segment τ+1 AND the hidden states from each layer of the previous segment τ (frozen, with stop-gradient). The segment-τ hidden states form an extended "memory" set of keys/values that new segment-τ+1 queries can attend to. Effective context grows from T (segment length) to T·N (where N is the number of retained previous segments) — at O(T²) attention cost for the new segment (instead of O((T·N)²)). (2) Relative positional encoding — the authors derive a special attention score form: A_ij = Q_i·K_j + Q_i·R_{i-j} + u·K_j + v·R_{i-j}, where R are learned embeddings of RELATIVE distances i-j (sinusoidal, but added like a^K embeddings from RPR). The four-term decomposition isolates positional from content effects, and R does not depend on absolute segment position — making the recurrence consistent. Implementation-wise, segment-τ hidden states are kept in a GPU memory buffer; with each new segment the oldest are overwritten (FIFO).

Problem solved

A standard autoregressive Transformer splits long text into fixed-length segments (e.g. 512 tokens) and treats each independently — leading to two problems: (1) "context fragmentation" — the first tokens of a segment have no context from the previous one, (2) maximum effective context is bounded by a single segment length. Naively extending to longer segments grows attention memory quadratically. Transformer-XL solves this: instead of lengthening the segment, it adds inter-segment recurrence with cached states. The flip side is broken absolute positional encoding — the token at position 0 of the new segment and the token at position 0 of the old segment share the same absolute position, confusing attention. Hence the paper introduces a relative PE specifically tailored to the recurrence.

Components

Hidden states cache (memory buffer)Extends the effective attention context without increasing segment length

FIFO buffer holding hidden states from N previous segments for each layer. Stop-gradient isolates it from backprop. Loaded as keys/values for the new segment.

INHidden states from previous segments, preserved with detached gradient.

OUTConcatenation of cache and new segment used as keys/values.

FIFO cache (canonical Transformer-XL)Oldest segments are evicted when the buffer fills up.

Compressed cache (Compressive Transformer)Older segments are compressed (e.g. via 1D conv) instead of discarded — DeepMind 2019.

Official

Relative positional encoding (4-term form)Positional consistency across segments and content/distance isolation

Four-term decomposition of the attention score with two learned vectors u, v and relative-distance embeddings R. Necessary for segment recurrence to remain positionally consistent.

INDistance matrix between new-segment tokens and all keys (cache + new).

OUTAttention logit matrix for the new segment.

Official

Implementation

Reference implementations

kimiyoung/transformer-xl (official repo)

Python (PyTorch / TensorFlow) · Carnegie Mellon University (Dai et al.)

Official

Hugging Face Transformers — TransfoXLModel

Python (PyTorch) · Hugging Face

Implementation pitfalls

Gradient leak through cached hidden statesHigh

An implementation without stop-gradient on cached states is unstable — the gradient propagates many segments back, which explodes memory and can cause training divergence.

Fix:Always apply `tensor.detach()` (PyTorch) or `tf.stop_gradient` on segment-τ hidden states before reusing them in segment τ+1.

Using absolute PE with segment recurrenceCritical

Combining standard absolute PE (sinusoidal/learned) with segment recurrence confuses attention — tokens at position 0 of the new and old segments share identical PE. Relative PE is required.

Fix:Use the four-term relative PE from the Transformer-XL paper or its derivatives (RoPE, T5 bias).

Too small M (memory length) vs TMedium

For M ≪ T the cache adds no meaningful context extension — the effect is comparable to baseline. Optimal M ≈ T to M ≈ 5·T.

Fix:Set M at least equal to T, ideally 2–5× T per the paper's ablations.

Evolution

Original paper · 2019 · ACL 2019 (Carnegie Mellon University + Google Brain) · Zihang Dai

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov

2017

Transformer (Vaswani et al.) — fixed-length context

The original Transformer splits text into fixed-length segments and treats them independently, creating the context fragmentation problem.

Transformer (concept)

2018

Relative Position Representations (Shaw et al.)

RPR shows that position can be modelled as distance instead of as an absolute index. Direct precursor of Transformer-XL's relative PE.

RPR (concept)

2019

Transformer-XL — CMU + Google Brain paper

Inflection point

Dai, Yang, Yang, Carbonell, Le, Salakhutdinov publish Transformer-XL at ACL 2019. They introduce segment-level recurrence and a new four-term relative PE form. SOTA on enwik8, text8, WikiText-103, One Billion Word.

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (paper)

2019

XLNet (Yang et al.) — Transformer-XL as backbone

XLNet (with significant author overlap) uses Transformer-XL as its backbone and adds permutation language modelling. One of the most prominent post-BERT breakthroughs.

2019

Compressive Transformer (Rae et al., DeepMind)

DeepMind extends Transformer-XL with COMPRESSIBLE memory — older hidden states are compressed (rather than discarded), extending effective context multifold. A direct successor.

2020

Decline in favour of sparse attention and RoPE

After Sparse Transformer (2019), Longformer/BigBird (2020), and RoPE (2021) appeared, the Transformer-XL recurrent approach became rarer in new large LLMs — most models choose a longer window + sparse/RoPE over recurrence + relative PE.

2024

Return in SSM hybrids (Mamba, RWKV)

The idea of hidden "memory" states passed between sequence steps returns in SSM architectures (Mamba) and RWKV — implementation differs from Transformer-XL's cached hidden states, but the "recurrent memory alongside attention" intuition comes straight from 2019.

Transformer-XL

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements