RecursiveMAS: AI agents communicate without tokens — 2.4× faster, 75% cheaper

Researchers from the University of Illinois Urbana-Champaign and Stanford University have developed RecursiveMAS, a framework that eliminates the fundamental cost of modern multi-agent systems: communication through text. Instead of generating and parsing token sequences, agents exchange internal latent representations — embeddings — achieving 2.4× faster inference and a 75% reduction in token usage compared to an equivalent text-based system. Code and trained model weights are available under the Apache 2.0 license.

Key takeaways

RecursiveMAS framework developed by UIUC and Stanford researchers, published May 2026
Agent communication through embeddings instead of text — 2.4× faster inference (range: 1.2×–2.4×)
75.6% token reduction by the third recursion round compared to Recursive-TextMAS
8.3% accuracy improvement over the strongest baselines across 9 benchmarks
Training costs more than 2× less than full fine-tuning — updates only ~0.31% of parameters

The problem: agents generating text for agents

Multi-Agent Systems work on a simple principle: one model generates a text response, passes it to the next, which processes it and passes it along. Each step involves a full token generation cycle — the model must translate its internal reasoning into text so the next agent can read it, then translate back into a vector representation.

This double translation creates three kinds of losses. First, latency: each agent waits for the previous one to finish generating text before starting its own processing. Second, token cost: intermediate reasoning that the user never needs must be encoded as visible token sequences. Third, training difficulty: updating a full multi-agent system via gradients requires backpropagation through text generation — a computationally demanding operation.

The standard approach to improving the system — per-agent fine-tuning or LoRA — does not solve the problem, because each agent still needs to communicate with the rest of the system through text.

RecursiveMAS: recursive architecture

The authors drew inspiration from recursive LLMs (RLMs), where instead of a linear layer stack, a set of layers processes data in a loop — the same "layer" handles multiple passes. RecursiveMAS extends this principle to an entire multi-agent system.

In practical terms: each agent acts like a single recursive layer. Instead of generating text for the next agent, it passes its last-layer hidden states — which contain the full semantic representation of its reasoning — directly. The final agent in the chain sends its embeddings back to the first, starting a new recursion round. Text appears only once: when the last agent returns the final answer in the last round.

The key technical element is the RecursiveLink module — a lightweight, two-layer component with two functions. The inner RecursiveLink operates inside an agent: instead of decoding text during reasoning, it maps generated embeddings back into the same model's input embedding space, creating a loop of latent thoughts. The outer RecursiveLink serves as the bridge between agents: because different models (Qwen, Llama-3, Gemma3, Mistral AI) may have embeddings in spaces of different dimensions, this layer aligns representations across models.

Base model weights remain frozen — training updates only the RecursiveLink parameters. That is approximately 13 million parameters, or 0.31% of the total parameters of the frozen models. Training cost is therefore more than 2× lower than full fine-tuning.

Results: what swapping text for embeddings delivers

The researchers evaluated RecursiveMAS across 9 benchmarks covering mathematics, science and medicine, code generation, and search-based question answering. Comparisons included standalone models enhanced with LoRA or full fine-tuning, alternative multi-agent frameworks (Mixture-of-Agents, TextGrad), and Recursive-TextMAS — the same recursive loop structure but with text-based communication.

The average accuracy improvement over the strongest baselines was 8.3%. The largest gap appears on reasoning-intensive tasks: RecursiveMAS outperformed TextGrad by 18.1% on AIME2025 and 13% on AIME2026.

Token savings accumulate over rounds: in the first recursion round, consumption is 34.6% lower than Recursive-TextMAS. By the third round, the advantage grows to 75.6%. Inference speedup ranges from 1.2× to 2.4× depending on configuration.

An additional benefit is shared backbone handling. If two agents in the system use the same foundation model in different roles, only one copy needs to be loaded into GPU memory — a single model copy with two separate RecursiveLink parameter sets.

Why it matters

Production deployments of multi-agent AI systems currently run into two walls: token cost and latency. Every inter-agent communication step is an API call with a bill and a delay from sequential generation. In systems with multiple planning rounds — such as coding agents, multi-step search-and-verification systems, or medical agents — those costs quickly become prohibitive.

RecursiveMAS proposes a qualitatively different approach: instead of optimizing text generation, it eliminates it entirely from the inter-agent communication layer. This is an architectural change, not a parametric one. For organizations deploying multi-model agentic pipelines, it means potentially 2–4× cheaper operation at higher accuracy.

The training approach is also significant: freezing base models and updating only lightweight connector layers lets organizations build and improve multi-agent systems without training large models from scratch or incurring LoRA costs across each one. Code and weights are publicly available under Apache 2.0 on GitHub, removing the entry barrier for both research teams and enterprise adopters.

What's next?

Authors have published code and model weights on GitHub and Hugging Face (Apache 2.0) — the framework is available for production testing now
Experiments used open-weights models (Qwen, Llama-3, Gemma3, Mistral) — the next step is validation on closed models and MoE architectures
A key open question remains: how the system scales beyond 3–4 agents and with very long contexts