Inference

RLM

2025ResearchPublished

Key innovation

The LLM receives long context as a variable in a REPL environment and itself decides how to partition, search, and delegate it to recursive sub-calls of smaller instances — instead of reading the entire context at once.

How it works

The root LM receives only the query q and the information that context C exists as a variable in the REPL. The environment (typically Python REPL / Jupyter) lets it execute arbitrary code over that variable — peek (read the first N characters), grep (regex), partition + map (chunking + parallel recursive LLM calls over fragments), summarization (condensing subsets). Each recursive call RLM_M(q̂, Ĉ) spawns an isolated sub-RLM instance with its own environment; its result returns to the caller's environment. In the reference implementation the authors capped depth at 1 (root can call LLMs but not other RLMs). The final answer is returned via a FINAL(...) tag (directly) or FINAL_VAR(variable_name) (from REPL memory).

Problem solved

Frontier LLMs degrade on long context (the "context rot" phenomenon), and the hard window (e.g. 272k for GPT-5) prevents handling corpora of millions of tokens. Standard solutions — RAG, ReAct with retriever, agentic problem-decomposition scaffolds — impose a fixed strategy, whereas RLM hands the context-decomposition decision to the model itself.

Key mechanisms

Context C stored as a variable in the REPL environment — the model never reads it all at once

Root LM (depth=0) sees only the query q and interacts with the context by generating Python code

Peek — reading the first N characters of context to recognize its structure

Grep / regex — narrowing the search space by pattern-matching in the context variable

Partition + Map — chunking the context and sequential or parallel recursive LM calls over fragments

Summarization — condensing context subsets via sub-call before returning to the root LM

Recursive LM calls as function calls in the REPL — a sub-LLM instance (depth=1) returns its result to the calling environment

Termination via FINAL(answer) or FINAL_VAR(var_name) — answer directly or from REPL memory

Drop-in compatibility — an RLM call is interface-identical to an LLM call (same input/output)

Strengths & limitations

Strengths

✓Handles >10M token contexts without additional foundation model training — only an inference wrapper needed

✓RLM(GPT-5-mini) outperforms GPT-5 by over 114% on OOLONG (132k tokens) at comparable query cost

✓On OOLONG (263k tokens) RLM(GPT-5-mini) still beats GPT-5 by ~49%

✓On BrowseComp-Plus (1,000 documents) RLM(GPT-5) is the only method maintaining near-perfect accuracy

✓General paradigm — the REPL environment can be replaced by any other environment (vector DB, filesystem)

✓Interpretable trajectories — peek/grep/partition+map readable as REPL logs, no black-box latents

✓RLM-Qwen3-8B (post-trained) outperforms base Qwen3-8B by 28.3% and approaches GPT-5 on 3 tasks

✓Self-improving as the base model improves — stronger LLM = stronger RLM without framework changes

✓Official open-source code on GitHub (alexzhang13/rlm, alexzhang13/rlm-minimal), CC BY 4.0 paper license

Limitations

✗Reference implementation is blocking and without prefix caching — each query takes seconds to minutes

✗No hard cost budget: max_iterations bounds steps but outlier queries can consume significant API spend

✗Executing arbitrary Python code over user data in the REPL requires sandboxing — risk of exfiltration or side effects without isolation

✗Recursion depth validated to 1; behavior at depth>1 or with >4 agents has not been systematically studied

✗BrowseComp-Plus performance based on 20 random queries — too small a sample for full generalization

✗Does not eliminate context rot in the sub-calls themselves — recursive LLMs still face window limits when processing chunks

✗Requires a base model capable of generating correct Python code — weaker models may produce faulty REPL code

✗Decomposition strategies (peek, grep, partition+map) are emergent and may be suboptimal for specific tasks

Components

Root LMContext-decomposition orchestrator

The main language model (depth=0). Receives only the query q and the information that context C exists in the environment. Decides the decomposition strategy and emits code to the REPL.

Official

REPL environment (Python)Memory and context-access mediator

A Python notebook (Jupyter-like) with context C loaded as an in-memory variable. Lets the root LM peek, grep, chunk, and invoke recursive LMs as function calls.

Python REPL / JupyterReference variant from the RLM paper.

Other execution environmentThe authors note that the choice of environment is flexible — REPL is an example, not a requirement.

Official

Recursive LM callsSub-query executors

Sub-instances of LLM (depth=1) invoked from the REPL over fragments or transformed context. In the reference setup GPT-5 or GPT-5-mini; depth limited to 1.

Official

FINAL / FINAL_VAR tagsAnswer-return protocol

Termination mechanism: FINAL(answer) returns the answer directly; FINAL_VAR(var_name) returns the value of a variable from REPL memory.

Implementation

Reference implementations

alexzhang13/rlm

Python · Alex Zhang (MIT CSAIL)

Official

alexzhang13/rlm-minimal

Python · Alex Zhang

Official

Implementation pitfalls

Lack of asynchrony in sub-callsHigh

The reference implementation runs each recursive LM call blocking and without prefix caching, stretching a single query from seconds to minutes.

Fix:Implement an async worker pool and prefix caching over shared chunk prefixes.

No hard cost/time budgetMedium

Max iterations bounds length but does not guarantee a hard API spend or wall-clock cap — outlier queries can be very expensive.

Fix:Add hard watchdogs on cumulative API cost and wall-clock per call.

Python REPL code sandboxingHigh

The REPL executes arbitrary LM-generated code over user data — without isolation this is a security risk (exfiltration, side effects).

Fix:Run the REPL in a sandbox (container, restricted net/disk permissions, library allowlist).

Evolution

Original paper · 2025 · arXiv preprint (MIT CSAIL) · Alex L. Zhang

Recursive Language Models

Alex L. Zhang, Tim Kraska, Omar Khattab

2025

Recursive Language Models blog post

Inflection point

October 2025 — Alex Zhang publishes the RLM concept on the MIT CSAIL blog along with first results on OOLONG and BrowseComp-Plus.

Recursive Language Models (blog) (paper)

2025

arXiv v1 preprint

December 31, 2025 — first arXiv version (2512.24601).

Recursive Language Models (paper)

2026

arXiv v3 preprint + RLM-Qwen3-8B

Inflection point

May 11, 2026 — v3 with the first post-trained RLM-specialized model: RLM-Qwen3-8B outperforms the base Qwen3-8B by 28.3% and approaches GPT-5 on three long-context tasks.

Recursive Language Models (v3) (paper)

Technical details

Hyperparameters (configurable axes)

Maximum recursion depthHigh

How many levels of RLM calls are allowed. The reference implementation uses depth=1 (root + one sub-call level). Greater depth yields stronger systems but increases cost.

Max root-LM iterationsHigh

Limit on REPL steps performed by the root LM before forced termination. Controls cost and time but does not guarantee a hard API budget.

Model for recursive sub-callsCritical

Which LLM handles sub-calls (e.g. GPT-5-mini as a cheap decomposer under a GPT-5 root). Strongly impacts cost and quality.

Environment typeHigh

What is actually exposed to the root LM (REPL, vector DB, filesystem). Reference: Python REPL with context as a variable.

Computational complexity

Computational characteristics

→Cost per query: comparable to a direct GPT-5 call (median — outlier queries may be more expensive)

→Latency: seconds to minutes per query with the reference blocking implementation

→Scaling with depth: O(d × n) LLM calls for depth d and n chunks per level — at depth=1 linear in n

→RLM-Qwen3-8B: +28.3% over base Qwen3-8B on 3 long-context tasks (post-training)

→OOLONG 132k: RLM(GPT-5-mini) +114% over GPT-5 (raw score more than double) at comparable cost

→OOLONG 263k: RLM(GPT-5-mini) +49% over GPT-5; counting problems reveal quality degradation as context grows

→BrowseComp-Plus: the only method maintaining near-100% accuracy at 1,000 documents (~5k words/doc, ~5M tokens)

Benchmark notes

RLM was evaluated on four task categories. (1) OOLONG trec_coarse: distributional queries over ~3,000–6,000 entries (contexts of 132k and 263k tokens). RLM(GPT-5-mini) beat GPT-5 by ~114% at 132k and ~49% at 263k; the no-recursion ablation (REPL only, no sub-calls) drops by ~10%. (2) BrowseComp-Plus: 20 randomly sampled multi-hop queries over a document corpus (10–1,000 docs). RLM(GPT-5) was the only method maintaining near-perfect accuracy at 1,000 docs. (3) Four long-context benchmarks (full v3 paper): median +26% vs compaction, +130% vs CodeAct sub-calls, +13% vs Claude Code. (4) RLM-Qwen3-8B (post-training): +28.3% over base Qwen3-8B, approaching GPT-5 on 3 of 4 tasks. Important caveat: BrowseComp-Plus results are based on 20 queries — the authors explicitly flag these as preliminary.

Execution paradigm

Primary mode

conditional

Context routing is emergent — learned from trajectories or improvised by the model at inference time.

Activation pattern

input_dependent

Routing mechanism

The root LM itself decides (by generating code in the REPL) which context fragments to analyze and when to delegate to a recursive call.

Parallelism

Parallelism level

partially_parallel

Recursive calls over chunks can run in parallel (e.g. partition+map), but the authors' reference implementation is blocking (each sub-call sequential). The authors mark this as significant low-hanging fruit for systems-level optimization.

Scope

inferenceacross_devices

Hardware requirements

Primary

RLM is an inference paradigm over existing LLMs — independent of the underlying model's hardware.

Sources

Recursive Language Models — blog post

Blog

Alex L. Zhang (MIT CSAIL)

Recursive Language Models (arXiv 2512.24601)