Robots Atlas>ROBOTS ATLAS
Inference

RLM

2025ResearchPublished
Key innovation
The LLM receives long context as a variable in a REPL environment and itself decides how to partition, search, and delegate it to recursive sub-calls of smaller instances — instead of reading the entire context at once.
Category
Inference
Abstraction level
Paradigm
Operation level
InferenceSystemOrchestrationAgent runtime
Use cases
Handling prompts >10M tokens without a retrieverLong-context Q&A (OOLONG, multi-hop over tens of thousands of entries)Deep Research over ~100k document corpora (BrowseComp-Plus)Programmatic history processing (LoCoDiff, git log)Long-output generation (e.g. BibTeX from a list of 100+ papers)Mitigating context rot in long Claude Code / Cursor sessionsScaling test-time compute as an alternative to CoT and agents

How it works

The root LM receives only the query q and the information that context C exists as a variable in the REPL. The environment (typically Python REPL / Jupyter) lets it execute arbitrary code over that variable — peek (read the first N characters), grep (regex), partition + map (chunking + parallel recursive LLM calls over fragments), summarization (condensing subsets). Each recursive call RLM_M(q̂, Ĉ) spawns an isolated sub-RLM instance with its own environment; its result returns to the caller's environment. In the reference implementation the authors capped depth at 1 (root can call LLMs but not other RLMs). The final answer is returned via a FINAL(...) tag (directly) or FINAL_VAR(variable_name) (from REPL memory).

Problem solved

Frontier LLMs degrade on long context (the "context rot" phenomenon), and the hard window (e.g. 272k for GPT-5) prevents handling corpora of millions of tokens. Standard solutions — RAG, ReAct with retriever, agentic problem-decomposition scaffolds — impose a fixed strategy, whereas RLM hands the context-decomposition decision to the model itself.

Key mechanisms

Context C stored as a variable in the REPL environment — the model never reads it all at once
Root LM (depth=0) sees only the query q and interacts with the context by generating Python code
Peek — reading the first N characters of context to recognize its structure
Grep / regex — narrowing the search space by pattern-matching in the context variable
Partition + Map — chunking the context and sequential or parallel recursive LM calls over fragments
Summarization — condensing context subsets via sub-call before returning to the root LM
Recursive LM calls as function calls in the REPL — a sub-LLM instance (depth=1) returns its result to the calling environment
Termination via FINAL(answer) or FINAL_VAR(var_name) — answer directly or from REPL memory
Drop-in compatibility — an RLM call is interface-identical to an LLM call (same input/output)

Strengths & limitations

Strengths
Handles >10M token contexts without additional foundation model training — only an inference wrapper needed
RLM(GPT-5-mini) outperforms GPT-5 by over 114% on OOLONG (132k tokens) at comparable query cost
On OOLONG (263k tokens) RLM(GPT-5-mini) still beats GPT-5 by ~49%
On BrowseComp-Plus (1,000 documents) RLM(GPT-5) is the only method maintaining near-perfect accuracy
General paradigm — the REPL environment can be replaced by any other environment (vector DB, filesystem)
Interpretable trajectories — peek/grep/partition+map readable as REPL logs, no black-box latents
RLM-Qwen3-8B (post-trained) outperforms base Qwen3-8B by 28.3% and approaches GPT-5 on 3 tasks
Self-improving as the base model improves — stronger LLM = stronger RLM without framework changes
Official open-source code on GitHub (alexzhang13/rlm, alexzhang13/rlm-minimal), CC BY 4.0 paper license
Limitations
Reference implementation is blocking and without prefix caching — each query takes seconds to minutes
No hard cost budget: max_iterations bounds steps but outlier queries can consume significant API spend
Executing arbitrary Python code over user data in the REPL requires sandboxing — risk of exfiltration or side effects without isolation
Recursion depth validated to 1; behavior at depth>1 or with >4 agents has not been systematically studied
BrowseComp-Plus performance based on 20 random queries — too small a sample for full generalization
Does not eliminate context rot in the sub-calls themselves — recursive LLMs still face window limits when processing chunks
Requires a base model capable of generating correct Python code — weaker models may produce faulty REPL code
Decomposition strategies (peek, grep, partition+map) are emergent and may be suboptimal for specific tasks

Components

Root LMContext-decomposition orchestrator

The main language model (depth=0). Receives only the query q and the information that context C exists in the environment. Decides the decomposition strategy and emits code to the REPL.

Official

REPL environment (Python)Memory and context-access mediator

A Python notebook (Jupyter-like) with context C loaded as an in-memory variable. Lets the root LM peek, grep, chunk, and invoke recursive LMs as function calls.

Python REPL / JupyterReference variant from the RLM paper.
Other execution environmentThe authors note that the choice of environment is flexible — REPL is an example, not a requirement.

Official

Recursive LM callsSub-query executors

Sub-instances of LLM (depth=1) invoked from the REPL over fragments or transformed context. In the reference setup GPT-5 or GPT-5-mini; depth limited to 1.

Official

FINAL / FINAL_VAR tagsAnswer-return protocol

Termination mechanism: FINAL(answer) returns the answer directly; FINAL_VAR(var_name) returns the value of a variable from REPL memory.

Implementation

Implementation pitfalls
Lack of asynchrony in sub-callsHigh

The reference implementation runs each recursive LM call blocking and without prefix caching, stretching a single query from seconds to minutes.

Fix:Implement an async worker pool and prefix caching over shared chunk prefixes.
No hard cost/time budgetMedium

Max iterations bounds length but does not guarantee a hard API spend or wall-clock cap — outlier queries can be very expensive.

Fix:Add hard watchdogs on cumulative API cost and wall-clock per call.
Python REPL code sandboxingHigh

The REPL executes arbitrary LM-generated code over user data — without isolation this is a security risk (exfiltration, side effects).

Fix:Run the REPL in a sandbox (container, restricted net/disk permissions, library allowlist).

Evolution

Original paper · 2025 · arXiv preprint (MIT CSAIL) · Alex L. Zhang
Recursive Language Models
Alex L. Zhang, Tim Kraska, Omar Khattab
2025
Recursive Language Models blog post
Inflection point

October 2025 — Alex Zhang publishes the RLM concept on the MIT CSAIL blog along with first results on OOLONG and BrowseComp-Plus.

2025
arXiv v1 preprint

December 31, 2025 — first arXiv version (2512.24601).

2026
arXiv v3 preprint + RLM-Qwen3-8B
Inflection point

May 11, 2026 — v3 with the first post-trained RLM-specialized model: RLM-Qwen3-8B outperforms the base Qwen3-8B by 28.3% and approaches GPT-5 on three long-context tasks.

Technical details

Hyperparameters (configurable axes)

Maximum recursion depthHigh

How many levels of RLM calls are allowed. The reference implementation uses depth=1 (root + one sub-call level). Greater depth yields stronger systems but increases cost.

Max root-LM iterationsHigh

Limit on REPL steps performed by the root LM before forced termination. Controls cost and time but does not guarantee a hard API budget.

Model for recursive sub-callsCritical

Which LLM handles sub-calls (e.g. GPT-5-mini as a cheap decomposer under a GPT-5 root). Strongly impacts cost and quality.

Environment typeHigh

What is actually exposed to the root LM (REPL, vector DB, filesystem). Reference: Python REPL with context as a variable.

Computational complexity

Computational characteristics
Cost per query: comparable to a direct GPT-5 call (median — outlier queries may be more expensive)
Latency: seconds to minutes per query with the reference blocking implementation
Scaling with depth: O(d × n) LLM calls for depth d and n chunks per level — at depth=1 linear in n
RLM-Qwen3-8B: +28.3% over base Qwen3-8B on 3 long-context tasks (post-training)
OOLONG 132k: RLM(GPT-5-mini) +114% over GPT-5 (raw score more than double) at comparable cost
OOLONG 263k: RLM(GPT-5-mini) +49% over GPT-5; counting problems reveal quality degradation as context grows
BrowseComp-Plus: the only method maintaining near-100% accuracy at 1,000 documents (~5k words/doc, ~5M tokens)
Benchmark notes

RLM was evaluated on four task categories. (1) OOLONG trec_coarse: distributional queries over ~3,000–6,000 entries (contexts of 132k and 263k tokens). RLM(GPT-5-mini) beat GPT-5 by ~114% at 132k and ~49% at 263k; the no-recursion ablation (REPL only, no sub-calls) drops by ~10%. (2) BrowseComp-Plus: 20 randomly sampled multi-hop queries over a document corpus (10–1,000 docs). RLM(GPT-5) was the only method maintaining near-perfect accuracy at 1,000 docs. (3) Four long-context benchmarks (full v3 paper): median +26% vs compaction, +130% vs CodeAct sub-calls, +13% vs Claude Code. (4) RLM-Qwen3-8B (post-training): +28.3% over base Qwen3-8B, approaching GPT-5 on 3 of 4 tasks. Important caveat: BrowseComp-Plus results are based on 20 queries — the authors explicitly flag these as preliminary.

Execution paradigm

Primary mode
conditional

Context routing is emergent — learned from trajectories or improvised by the model at inference time.

Activation pattern
input_dependent
Routing mechanism

The root LM itself decides (by generating code in the REPL) which context fragments to analyze and when to delegate to a recursive call.

Parallelism

Parallelism level
partially_parallel

Recursive calls over chunks can run in parallel (e.g. partition+map), but the authors' reference implementation is blocking (each sub-call sequential). The authors mark this as significant low-hanging fruit for systems-level optimization.

Scope
inferenceacross_devices

Hardware requirements

Primary

RLM is an inference paradigm over existing LLMs — independent of the underlying model's hardware.