1) The system receives a context (document, agent history, task state) without an active query. 2) During idle time the LLM performs context-directed reasoning — anticipating likely questions, drawing inferences, summarizing, planning. 3) The results are stored as enriched context or persistent agent memory. 4) When a real user query arrives, the model reuses these pre-computed artifacts, so the test-time compute needed for an accurate answer is significantly reduced. 5) In multi-query settings against the same context, the sleep-time cost is amortized across all subsequent queries.
Test-time scaling (long reasoning chains executed when a query arrives) drastically increases LLM inference latency and cost. Sleep-time compute addresses this by shifting part of the reasoning into idle periods before the user asks anything.
The efficacy of sleep-time compute correlates strongly with the predictability of future queries. When user questions are highly open-ended and unexpected, the pre-computed inferences are not useful and the sleep-time work is wasted.
If the context changes faster than the sleep-time cycle, pre-computed inferences can become stale and introduce errors into responses.
Lin et al. publish arXiv:2504.13171 and reference code letta-ai/sleep-time-compute, defining sleep-time compute as an alternative to test-time scaling.