Inference

LLM Sleep

2025ResearchPublished: 6 June 2026Published

Key innovation

Shifts part of an LLM's reasoning from query time (test-time) to idle periods (sleep-time), pre-computing useful conclusions about a context before the user even asks a question.

How it works

1) The system receives a context (document, agent history, task state) without an active query. 2) During idle time the LLM performs context-directed reasoning — anticipating likely questions, drawing inferences, summarizing, planning. 3) The results are stored as enriched context or persistent agent memory. 4) When a real user query arrives, the model reuses these pre-computed artifacts, so the test-time compute needed for an accurate answer is significantly reduced. 5) In multi-query settings against the same context, the sleep-time cost is amortized across all subsequent queries.

Problem solved

Test-time scaling (long reasoning chains executed when a query arrives) drastically increases LLM inference latency and cost. Sleep-time compute addresses this by shifting part of the reasoning into idle periods before the user asks anything.

Implementation

Reference implementations

letta-ai/sleep-time-compute

Python · Letta AI

Official

Implementation pitfalls

Low query predictabilityHigh

The efficacy of sleep-time compute correlates strongly with the predictability of future queries. When user questions are highly open-ended and unexpected, the pre-computed inferences are not useful and the sleep-time work is wasted.

Fix:Use in settings with a strong query prior (agents with persistent memory, long documents, repetitive tasks). Profile the query distribution before investing in pre-computation.

Stale pre-computed contextMedium

If the context changes faster than the sleep-time cycle, pre-computed inferences can become stale and introduce errors into responses.

Fix:Track context changes and invalidate or refresh pre-computed artifacts. Restrict sleep-time pre-computation to stable parts of the context.

Evolution

Original paper · 2025 · arXiv preprint (UC Berkeley / Letta) · Kevin Lin

Sleep-time Compute: Beyond Inference Scaling at Test-time

Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, Joseph E. Gonzalez

2025

Introduction of the Sleep-time Compute paradigm

Inflection point

Lin et al. publish arXiv:2504.13171 and reference code letta-ai/sleep-time-compute, defining sleep-time compute as an alternative to test-time scaling.

Sleep-time Compute: Beyond Inference Scaling at Test-time (paper)

Sources

Sleep-time Compute: Beyond Inference Scaling at Test-time

Paper

arXiv

letta-ai/sleep-time-compute (GitHub)

Repository

Letta AI