Do Language Models Need Sleep? Offline Recurrence and Memory Consolidation

Transformers are the default backbone of large language models, but their attention mechanism scales poorly: compute grows quadratically with context length and cache memory grows linearly. Four researchers from Carnegie Mellon University and the University of Maryland propose an unusual, biology-inspired fix: let the model "sleep". During sleep the model repeatedly processes the accumulated context and writes it into persistent fast weights before clearing its attention cache.

Key takeaways

• "LLM sleep" is a consolidation phase: the model performs N offline recurrent forward passes over the accumulated context and updates the fast weights inside its state-space model (SSM) blocks before evicting the key-value (KV) cache.

• All the extra computation is shifted into the sleep phase. Prediction (wake time) stays a single forward pass, so response latency is unchanged.

• Core insight: the bottleneck in SSM-attention hybrids is not memory capacity but the amount of computation available to turn evicted context into a useful internal state.

• Longer sleep (larger N) improves results on synthetic tasks (the Rule 110 cellular automaton and the Depo multi-hop graph retrieval) and on the realistic GSM-Infinite math benchmark. The largest gains appear where deeper reasoning is required.

• The method has a price: training requires N deeper forward and backward passes, which can be slow and unstable.

What is "LLM sleep"?

Modern large language models rely on the transformer architecture, which stores context in an attention cache (the KV cache) and retrieves past tokens on demand. This is excellent for quality but scales badly — total attention compute grows quadratically with context length.

Hybrid models interleave attention layers with recurrent (SSM) layers that store the past in a fixed-size "fast weight" memory. The authors ask whether such memory is enough to reason about content the model can no longer see inside its context window. Their answer: capacity alone is not enough — extra computation is needed to transform stored context into a state useful for later inference.

"LLM sleep" is a memory-consolidation mechanism. When the context window fills up, the model enters a sleep: it receives no new tokens but repeatedly (N times) processes the accumulated context and recursively updates its fast weights via a learned local rule. Only after consolidation does it clear the attention cache and resume work. The inspiration is neuroscience: in animals, transferring short-term to long-term memory is linked to hippocampal replay, especially during sleep.

How it works

The starting point is an SSM-attention hybrid with a fixed context window of size L. The sequence is split into non-overlapping chunks of at most L tokens, and after each chunk the KV cache is fully evicted (hard eviction). This naturally divides processing into two phases: a consolidation phase (the model must encode context into fast weights) and a prediction phase (the model predicts the answer).

The recurrent layer uses a gated Hebbian-like rule (a Mamba-2-style update). The memory state S is updated by an outer product of keys and values, with forget and write gates:

\dots

Unlike the KV cache, the state S does not grow with sequence length — the past must be compressed into a fixed-size memory. In their experiments the authors use Gated Delta Networks (GDNs), which add a delta-rule correction to this update. During sleep the model loops N times over the architecture blocks, each time refining the fast weights in the SSM blocks. With N = 1 the method reduces to a vanilla hybrid.

What matters most is where the gradient flows. Unlike prior looped models, where gradients pass through recursively refined feature vectors, here the gradient flows through the refined fast weights — because the refined features are simply discarded after sleep. As a result all the extra computation is "frozen" into the weights and serves the later single-pass prediction.

Diagram 1 — The wake-sleep cycle under hard eviction

Plaintext

flowchart TD
  A[Token stream] --> B[Context window full - L tokens]
  B --> C{Phase}
  C -->|Consolidation| D[SLEEP: N offline forward passes]
  D --> E[Update fast weights S in SSM blocks]
  E --> F[Evict KV cache]
  F --> B
  C -->|Prediction| G[Single forward pass]
  G --> H[Answer - fixed wake-time latency]

The diagram shows that extra computation goes only into the consolidation (sleep) phase. The prediction phase always remains a single forward pass, keeping response latency constant.

Key components

Hard eviction and two phases. Every L tokens the window is cleared. In the consolidation phase the loss mask is all-zero (the model only encodes), while in the prediction phase the model computes a masked cross-entropy loss on the answer tokens.

Prediction-phase latency constraint. In the prediction phase each answer token is produced in one standard forward pass. Extra loops or chain-of-thought tokens are disallowed because they would add latency. All knowledge needed to predict must be consolidated beforehand.

Sleep as reasoning depth. The architecture is related to looped / depth-recurrent models. The authors show that simply increasing the number of sleep loops (N from 2 to 4) systematically improves results on the hardest instances — those requiring the deepest reasoning over evicted context.

The test tasks. Rule 110 is a one-dimensional cellular automaton (a P-complete problem) where the parameter t controls the required reasoning depth. Depo is a multi-hop graph retrieval task where deeper queries (larger k) require deeper traversal. GSM-Infinite is a synthetic math benchmark modeled after GSM8K that stresses long context and multi-step reasoning at once.

Diagram 2 — Vanilla hybrid vs "LLM sleep"

Plaintext

flowchart LR
  subgraph Base[Vanilla hybrid - N=1]
    A1[Context] --> A2[1 pass] --> A3[Fast weights] --> A4[Shallow reasoning]
  end
  subgraph Sleep[LLM sleep - N greater than 1]
    B1[Context] --> B2[N passes - sleep] --> B3[Refined fast weights] --> B4[Deeper reasoning]
  end

The comparison captures the essence of the method: with the same context length, eviction rule and prediction-phase cost, the number of consolidation loops is what makes the difference. More sleep means more steps to turn context into a representation that supports inference.

Differences vs. alternatives

Versus vanilla SSM-attention hybrids. Standard hybrids have enough capacity to store context, but their performance degrades as the required reasoning depth grows — even when the amount of information to store is held fixed. "LLM sleep" attacks this compute deficit, not a capacity deficit.

Versus context compression and distillation. Compression methods shorten what remains in the attention window. Context distillation trains a model to imitate a "contextful teacher" via gradient descent on predefined losses. Here, instead of gradient descent, a learned recurrent update rule is used — a more flexible form of consolidation.

Versus test-time training. Related work takes a single gradient step per context chunk. Here the memory-update rule is a learned forward pass and need not correspond to a one-step gradient update. And unlike looped models at prediction time, this method does not loop at answer time — the extra computation has already been spent forming the weights.

Applications

The most immediate application area is long-horizon tasks where the model must reason about information already evicted from the active attention window: long math and logic problems, multi-hop knowledge retrieval, and simulation of sequential processes.

The authors also validate the method on pre-trained models. They fine-tune the Jet-Nemotron 2B hybrid and the looped Ouro 1.4B model on GSM-Infinite. For Jet, six loops raise accuracy on six-operation problems from 0.742 to 0.812 and on eight-operation problems from 0.351 to 0.388. For Ouro, four loops raise six-operation accuracy from 0.419 to 0.615 and eight-operation accuracy from 0.210 to 0.272.

A sliding-window eviction variant retains the most recent L−1 tokens. With a window of L = 512 and Ouro 1.4B, longer sleep lifts accuracy on two-operation problems from 0.596 to 0.905 (a 52% improvement). This suggests that when the active window is much smaller than the full sequence, longer sleep helps not only with reasoning but also with compressing and retrieving relevant context.

Limitations

Training cost. The method moves extra computation into the consolidation phase, but this is not free: training requires N deeper forward and backward passes, which can be slow and unstable. Cost grows roughly linearly with the number of loops N.

Sequentiality. Sleep makes training sequential across context and depth — before processing the next window you must finish the previous one and run N sleep passes. This prevents full parallelization along the sequence axis, although with a large window L it need not hurt wall-clock time, because the GPU stays saturated.

Scope of evidence. The study rests on controlled synthetic tasks and modest-scale models. The authors themselves note that stabilizing deep recurrence (e.g., via implicit gradients or truncated backpropagation through time) remains an open and active research topic.

Why it matters

The work reframes the question of memory in reasoning models. It shows that scalable memory is not the same as scalable reasoning — and that recurrence can serve not only to generate answers but also to consolidate knowledge.

First, it separates two costs that usually blur together in long context: the cost of storing information and the cost of processing it into a useful state. This lets reasoning scale without raising response latency.

Second, the sleep analogy is more than a metaphor — it is a design principle for budgeting compute. Expensive "thinking" is moved out of the answer moment, much as memory consolidation in animals happens offline, when the organism does not respond to stimuli.

Third, the result is a warning sign for SSM-attention hybrids themselves. The mere fact that such an architecture has enough memory does not guarantee it can handle deep, sequential inference over content it can no longer attend to. Without extra recurrence, it is easy to fall into brittle shortcut solutions.

Sources

1. Lee S., McLeish S., Goldstein T., Fanti G. "Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference". arXiv:2605.26099v2 (2026). https://arxiv.org/abs/2605.26099

2. Full text (PDF). https://arxiv.org/pdf/2605.26099

3. HTML version. https://arxiv.org/html/2605.26099v2