Robots Atlas>ROBOTS ATLAS
Architecture

MoA

2024productionpreprint
Key innovation
A layered multi-agent architecture in which each layer contains N agent-LLMs generating responses in parallel for the same query, after which all responses are aggregated and passed as context to the agents in the next layer. Wang et al. demonstrated the "collaborativeness of LLMs" โ€” an effect where open-source LLMs produce higher-quality responses when given access to other models' proposals, even when those proposals are individually weaker.
Category
Architecture
Abstraction level
Pattern
Operation level
SystemInferenceOrchestrationAgent runtime
Use cases
Open-source conversational systems at GPT-4 level โ€” MoA composition of multiple OSS LLMs delivers quality exceeding closed frontier models (Together MoA on AlpacaEval 2.0)High-quality long-form response generation โ€” essays, reports, technical analysis โ€” where different models contribute different perspectivesMathematical and logical reasoning โ€” agents with different specializations (code, formal logic, language reasoning) critiquing each otherTranslation and style adaptation โ€” proposers from different models generate variants, aggregator selects best elementsCode review and refactoring โ€” multiple LLMs propose fixes, aggregator synthesizes consensusHybrid OSS + commercial stacks โ€” cheaper way to achieve GPT-4-class quality using MoA over OSS models instead of a single frontier API callRAG enhancement โ€” multiple proposer agents interpret retrieved context from different perspectives, aggregator synthesizes the final answer

How it works

The MoA architecture consists of L layers. Each layer l (for l < L) contains n proposer agents A_{l,1}, ..., A_{l,n}, where each agent is a call to a specific LLM with a dedicated prompt. The input to layer l is: (1) the original user query x, (2) the concatenation of all responses from layer l-1: y_{l-1} = [y_{l-1,1}, ..., y_{l-1,n}]. Each agent A_{l,i} receives a prompt of the form "Here is query x and proposals from the previous layer y_{l-1}. Produce an improved response" and outputs y_{l,i}. All proposals y_{l,1}, ..., y_{l,n} are independent (parallel calls). The final layer L contains a single aggregator that receives proposals from layer L-1 and generates the final response. Model selection follows two criteria: (a) performance โ€” stronger models on a given task type as aggregators, (b) diversity โ€” proposers from different model families (heterogeneous error profiles). Together MoA reference uses 4 layers with 6 agents in each.

Problem solved

A single LLM, even the strongest, is bounded by its training data, biases, and knowledge gaps. Classical LLM ensembling (e.g., self-consistency, majority voting) scales poorly because all votes come from the same model or models with similar error profiles. Conversely, standalone calls to different LLMs (zero-shot multi-model voting) fail to exploit their mutual complementarity. MoA addresses this through structural composition: agents from different models see each other's proposals and can critique, refine, and synthesize them. This unlocks the complementarity of different models' strengths (e.g., math, coding, language) without training, fine-tuning, or external infrastructure.

Key mechanisms

Layered structure of L layers, where each layer contains n LLM agents
Parallel proposal generation within a single layer (parallel fan-out)
Concatenation of layer-l proposals as the input context for layer l+1
Aggregate-and-Synthesize prompt instructing agents how to critically evaluate and improve previous-layer proposals
A single aggregator in the final layer synthesizes the final answer
Model heterogeneity โ€” proposers from different families (Qwen, Llama, Mixtral, DBRX) provide complementary error profiles
No fine-tuning โ€” the entire architecture is defined by prompts and a static topology
Collaborativeness effect โ€” even weaker models as proposers improve the quality of a stronger aggregator
Static configuration โ€” model choice and topology are fixed upfront, no runtime routing

Strengths & limitations

Strengths
โœ“Achieves GPT-4-class quality using only OSS models โ€” Together MoA 65.1% AlpacaEval 2.0 vs GPT-4 Omni 57.5%
โœ“Requires no fine-tuning or training โ€” pure prompt engineering plus orchestration
โœ“Exploits model complementarity โ€” diverse proposers catch different error types
โœ“Easy implementation โ€” a single loop over layers with parallel calls, ~100 LoC in the reference implementation
โœ“Quality scalability โ€” more layers or proposers = higher quality (with diminishing returns)
โœ“Composability with other techniques โ€” MoA + CoT, MoA + RAG, MoA + ReAct yield additional improvements
โœ“Error detection via peer-review โ€” proposers cross-evaluate each other's proposals
โœ“Open stack โ€” reference code available, OSS models, no vendor lock-in
Limitations
โœ—High token cost โ€” full MoA consumes 20โ€“100ร— the tokens of a single query
โœ—High latency โ€” sequential layers yield 30โ€“120s per query (vs 5โ€“15s for a single LLM)
โœ—Context bloat โ€” concatenating n proposals grows linearly with the number of proposers
โœ—Aggregator sensitivity โ€” a weak final model nullifies the gain from proposer diversity
โœ—Echo chamber โ€” proposers from similar model families can mutually reinforce errors
โœ—No dynamic routing โ€” all proposers are invoked for every query, even trivial ones
โœ—Debugging difficulty โ€” an error in the final answer can come from any layer and any agent
โœ—Prompt sensitivity โ€” different Aggregate-and-Synthesize prompt variants yield 5โ€“10 pp differences on benchmarks
โœ—No gradient propagation โ€” MoA is defined at inference, it cannot be trained end-to-end

Components

Proposer Agent

An LLM agent in layer l < L that independently generates a response proposal based on the query and the previous layer's proposals. Typically n=6 proposers per layer in the reference Together MoA.

Aggregator Agent

An LLM agent in the final layer L that synthesizes proposals from layer L-1 into one coherent answer. Selected as the strongest model for the given task.

Layered Structure

A sequence of L layers where outputs of layer l are concatenated and fed as context to layer l+1. Typically L=4 in the reference Together configuration.

Aggregate-and-Synthesize Prompt

A special prompt instructing the agent how to interpret previous-layer proposals: critical evaluation, error identification, synthesis of best elements. The entire architecture is defined by this prompt plus a static topology.

Implementation

Reference implementations
Implementation pitfalls
High

Each proposer in layer l receives a concatenation of n proposals from layer l-1. For L=4, n=6, and 500-token responses: layer 4 receives ~3000 tokens of intermediate context ร— 6 calls = huge overhead. Set hard max_tokens limits for proposers.

Medium

Despite intra-layer parallelism, L layers must run sequentially. Full MoA can have 30โ€“120s latency vs 5โ€“15s for a single query.

Medium

If proposers are too similar (same model families, similar training data), they can mutually reinforce errors. Diversity of model selection is critical.

Medium

A weak aggregator model cannot exploit diverse proposals well โ€” the aggregator must have higher reasoning ability than most proposers.

Low

The Aggregate-and-Synthesize prompt significantly affects results โ€” wording variations can yield 5โ€“10 pp differences on benchmarks.

Evolution

Original paper ยท 2024
Mixture-of-Agents Enhances Large Language Model Capabilities
, , , ,
Wang et al. publish "Mixture-of-Agents Enhances Large Language Model Capabilities" on arXiv (2406.04692). Together AI releases an implementation alongside the paper.
Together MoA reaches 65.1% on AlpacaEval 2.0, beating GPT-4 Omni (57.5%) โ€” the first time an OSS configuration beats a closed frontier model on this benchmark.
MoA implementations appear in LangChain, LlamaIndex, CrewAI, and other agent orchestration frameworks.
Technical details

Hyperparameters (configurable axes)

Number of layers L

Typical values: 2โ€“6. Together MoA uses L=4. Greater depth = higher quality but linearly growing cost and latency.

Number of proposers n per layer

Typical values: 3โ€“6. Higher n = greater diversity but exponentially growing token cost (each proposer sees the concatenation of n proposals).

Model selection (performance vs diversity)

Trade-off between selecting the strongest models (all GPT-4-class) versus diversity (heterogeneity of families: Qwen + Llama + Mixtral + DBRX). Wang et al. recommend diversity for proposers, performance for the aggregator.

Aggregate-and-Synthesize prompt

Variant of the prompt used by agents to interpret previous-layer proposals โ€” has a significant impact on final-answer quality.

Token budget / cost

Full MoA (4 layers ร— 6 proposers) vs MoA-Lite (3 layers with fewer proposers). Lite reaches 59.3% AlpacaEval at significantly lower cost.

Computational complexity

Computational characteristics
โ†’LLM call count: L ร— n + 1 (reference: 4 ร— 6 + 1 = 25 calls per query)
โ†’Token cost: 20โ€“100ร— a single call, depending on L, n, and proposal length
โ†’Latency: sequential layers = L ร— max(per-proposer latency), typically 30โ€“120s
โ†’Memory scaling: linear in n proposers per layer (context grows with n)
โ†’Hardware: agnostic โ€” MoA is API-call orchestration, works with any LLM provider
โ†’Together MoA cost: ~$0.50โ€“$2.00 per query (6 OSS models, 4 layers) vs ~$0.10โ€“$0.50 GPT-4 single call
โ†’AlpacaEval 2.0: Together MoA 65.1%, GPT-4 Omni 57.5%, MoA-Lite 59.3% โ€” MoA beats closed frontier models at higher cost
Benchmark notes

MoA is evaluated primarily on 3 benchmarks. (1) AlpacaEval 2.0 (LC win rate): Together MoA 65.1%, MoA-Lite 59.3%, GPT-4 Omni 57.5%, GPT-4 Turbo 55.0%. The first instance where an OSS configuration beats a frontier closed model on AlpacaEval. (2) MT-Bench: Together MoA 9.25, GPT-4 Turbo 9.32, GPT-4 Omni 9.19 โ€” MoA comparable or marginally below closed models. (3) FLASK (fine-grained skill assessment): Together MoA surpasses GPT-4 Omni in 10 of 12 evaluation dimensions, including robustness, correctness, and efficiency. Caveats: MoA cost is 5โ€“10ร— higher than a single GPT-4o call; for latency-sensitive applications (real-time chat) MoA can be unacceptably slow (30โ€“120s vs 5โ€“15s). The Wang et al. benchmarks do not cover math or HumanEval/MATH-level coding โ€” those domains require separate validation.

Execution paradigm

Primary mode
always_on

All agents in a layer are activated for every query (parallel fan-out), but the choice of proposers and aggregator is input-independent โ€” defined statically in the architecture configuration.

Activation pattern
input_dependent
Routing mechanism

MoA does not use dynamic routing โ€” every query passes through all proposers in every layer. Heterogeneity comes from static selection of different models, not runtime routing.

Parallelism

Parallelism level
highly_parallel

Within a single MoA layer it is fully parallel โ€” n proposers called concurrently. Between layers: sequentiality enforced by the y_l โ†’ y_{l+1} dependency.

Scope
inferenceacross_devices
Constraints
!All proposers in layer l must finish generating before layer l+1 starts โ€” the synchronization barrier introduces latency equal to max(proposer time).
!Parallel calls to multiple LLMs from different providers may hit throttling โ€” careful quota management is required.
!Concatenating proposals from the previous layer grows the next layer's context โ€” with n proposers and long answers, context grows linearly with n.