RLM
How it works
The root LM receives only the query q and the information that context C exists as a variable in the REPL. The environment (typically Python REPL / Jupyter) lets it execute arbitrary code over that variable — peek (read the first N characters), grep (regex), partition + map (chunking + parallel recursive LLM calls over fragments), summarization (condensing subsets). Each recursive call RLM_M(q̂, Ĉ) spawns an isolated sub-RLM instance with its own environment; its result returns to the caller's environment. In the reference implementation the authors capped depth at 1 (root can call LLMs but not other RLMs). The final answer is returned via a FINAL(...) tag (directly) or FINAL_VAR(variable_name) (from REPL memory).
Problem solved
Frontier LLMs degrade on long context (the "context rot" phenomenon), and the hard window (e.g. 272k for GPT-5) prevents handling corpora of millions of tokens. Standard solutions — RAG, ReAct with retriever, agentic problem-decomposition scaffolds — impose a fixed strategy, whereas RLM hands the context-decomposition decision to the model itself.
Key mechanisms
Strengths & limitations
Components
The main language model (depth=0). Receives only the query q and the information that context C exists in the environment. Decides the decomposition strategy and emits code to the REPL.
Official
A Python notebook (Jupyter-like) with context C loaded as an in-memory variable. Lets the root LM peek, grep, chunk, and invoke recursive LMs as function calls.
Official
Sub-instances of LLM (depth=1) invoked from the REPL over fragments or transformed context. In the reference setup GPT-5 or GPT-5-mini; depth limited to 1.
Official
Termination mechanism: FINAL(answer) returns the answer directly; FINAL_VAR(var_name) returns the value of a variable from REPL memory.
Implementation
The reference implementation runs each recursive LM call blocking and without prefix caching, stretching a single query from seconds to minutes.
Max iterations bounds length but does not guarantee a hard API spend or wall-clock cap — outlier queries can be very expensive.
The REPL executes arbitrary LM-generated code over user data — without isolation this is a security risk (exfiltration, side effects).
Evolution
October 2025 — Alex Zhang publishes the RLM concept on the MIT CSAIL blog along with first results on OOLONG and BrowseComp-Plus.
December 31, 2025 — first arXiv version (2512.24601).
May 11, 2026 — v3 with the first post-trained RLM-specialized model: RLM-Qwen3-8B outperforms the base Qwen3-8B by 28.3% and approaches GPT-5 on three long-context tasks.
Technical details
Hyperparameters (configurable axes)
How many levels of RLM calls are allowed. The reference implementation uses depth=1 (root + one sub-call level). Greater depth yields stronger systems but increases cost.
Limit on REPL steps performed by the root LM before forced termination. Controls cost and time but does not guarantee a hard API budget.
Which LLM handles sub-calls (e.g. GPT-5-mini as a cheap decomposer under a GPT-5 root). Strongly impacts cost and quality.
What is actually exposed to the root LM (REPL, vector DB, filesystem). Reference: Python REPL with context as a variable.
Computational complexity
RLM was evaluated on four task categories. (1) OOLONG trec_coarse: distributional queries over ~3,000–6,000 entries (contexts of 132k and 263k tokens). RLM(GPT-5-mini) beat GPT-5 by ~114% at 132k and ~49% at 263k; the no-recursion ablation (REPL only, no sub-calls) drops by ~10%. (2) BrowseComp-Plus: 20 randomly sampled multi-hop queries over a document corpus (10–1,000 docs). RLM(GPT-5) was the only method maintaining near-perfect accuracy at 1,000 docs. (3) Four long-context benchmarks (full v3 paper): median +26% vs compaction, +130% vs CodeAct sub-calls, +13% vs Claude Code. (4) RLM-Qwen3-8B (post-training): +28.3% over base Qwen3-8B, approaching GPT-5 on 3 of 4 tasks. Important caveat: BrowseComp-Plus results are based on 20 queries — the authors explicitly flag these as preliminary.
Execution paradigm
Context routing is emergent — learned from trajectories or improvised by the model at inference time.
The root LM itself decides (by generating code in the REPL) which context fragments to analyze and when to delegate to a recursive call.
Parallelism
Recursive calls over chunks can run in parallel (e.g. partition+map), but the authors' reference implementation is blocking (each sub-call sequential). The authors mark this as significant low-hanging fruit for systems-level optimization.
Hardware requirements
RLM is an inference paradigm over existing LLMs — independent of the underlying model's hardware.