Architecture

MoA

2024productionpreprint

Key innovation

A layered multi-agent architecture in which each layer contains N agent-LLMs generating responses in parallel for the same query, after which all responses are aggregated and passed as context to the agents in the next layer. Wang et al. demonstrated the "collaborativeness of LLMs" — an effect where open-source LLMs produce higher-quality responses when given access to other models' proposals, even when those proposals are individually weaker.

How it works

The MoA architecture consists of L layers. Each layer l (for l < L) contains n proposer agents A_{l,1}, ..., A_{l,n}, where each agent is a call to a specific LLM with a dedicated prompt. The input to layer l is: (1) the original user query x, (2) the concatenation of all responses from layer l-1: y_{l-1} = [y_{l-1,1}, ..., y_{l-1,n}]. Each agent A_{l,i} receives a prompt of the form "Here is query x and proposals from the previous layer y_{l-1}. Produce an improved response" and outputs y_{l,i}. All proposals y_{l,1}, ..., y_{l,n} are independent (parallel calls). The final layer L contains a single aggregator that receives proposals from layer L-1 and generates the final response. Model selection follows two criteria: (a) performance — stronger models on a given task type as aggregators, (b) diversity — proposers from different model families (heterogeneous error profiles). Together MoA reference uses 4 layers with 6 agents in each.

Problem solved

A single LLM, even the strongest, is bounded by its training data, biases, and knowledge gaps. Classical LLM ensembling (e.g., self-consistency, majority voting) scales poorly because all votes come from the same model or models with similar error profiles. Conversely, standalone calls to different LLMs (zero-shot multi-model voting) fail to exploit their mutual complementarity. MoA addresses this through structural composition: agents from different models see each other's proposals and can critique, refine, and synthesize them. This unlocks the complementarity of different models' strengths (e.g., math, coding, language) without training, fine-tuning, or external infrastructure.

Key mechanisms

Layered structure of L layers, where each layer contains n LLM agents

Parallel proposal generation within a single layer (parallel fan-out)

Concatenation of layer-l proposals as the input context for layer l+1

Aggregate-and-Synthesize prompt instructing agents how to critically evaluate and improve previous-layer proposals

A single aggregator in the final layer synthesizes the final answer

Model heterogeneity — proposers from different families (Qwen, Llama, Mixtral, DBRX) provide complementary error profiles

No fine-tuning — the entire architecture is defined by prompts and a static topology

Collaborativeness effect — even weaker models as proposers improve the quality of a stronger aggregator

Static configuration — model choice and topology are fixed upfront, no runtime routing

Strengths & limitations

Strengths

✓Achieves GPT-4-class quality using only OSS models — Together MoA 65.1% AlpacaEval 2.0 vs GPT-4 Omni 57.5%

✓Requires no fine-tuning or training — pure prompt engineering plus orchestration

✓Exploits model complementarity — diverse proposers catch different error types

✓Easy implementation — a single loop over layers with parallel calls, ~100 LoC in the reference implementation

✓Quality scalability — more layers or proposers = higher quality (with diminishing returns)

✓Composability with other techniques — MoA + CoT, MoA + RAG, MoA + ReAct yield additional improvements

✓Error detection via peer-review — proposers cross-evaluate each other's proposals

✓Open stack — reference code available, OSS models, no vendor lock-in

Limitations

✗High token cost — full MoA consumes 20–100× the tokens of a single query

✗High latency — sequential layers yield 30–120s per query (vs 5–15s for a single LLM)

✗Context bloat — concatenating n proposals grows linearly with the number of proposers

✗Aggregator sensitivity — a weak final model nullifies the gain from proposer diversity

✗Echo chamber — proposers from similar model families can mutually reinforce errors

✗No dynamic routing — all proposers are invoked for every query, even trivial ones

✗Debugging difficulty — an error in the final answer can come from any layer and any agent

✗Prompt sensitivity — different Aggregate-and-Synthesize prompt variants yield 5–10 pp differences on benchmarks

✗No gradient propagation — MoA is defined at inference, it cannot be trained end-to-end

Components

Proposer Agent

An LLM agent in layer l < L that independently generates a response proposal based on the query and the previous layer's proposals. Typically n=6 proposers per layer in the reference Together MoA.

Aggregator Agent

An LLM agent in the final layer L that synthesizes proposals from layer L-1 into one coherent answer. Selected as the strongest model for the given task.

Layered Structure

A sequence of L layers where outputs of layer l are concatenated and fed as context to layer l+1. Typically L=4 in the reference Together configuration.

Aggregate-and-Synthesize Prompt

A special prompt instructing the agent how to interpret previous-layer proposals: critical evaluation, error identification, synthesis of best elements. The entire architecture is defined by this prompt plus a static topology.

Implementation

Reference implementations

Together MoA

LangChain MoA template

Implementation pitfalls

High

Each proposer in layer l receives a concatenation of n proposals from layer l-1. For L=4, n=6, and 500-token responses: layer 4 receives ~3000 tokens of intermediate context × 6 calls = huge overhead. Set hard max_tokens limits for proposers.

Medium

Despite intra-layer parallelism, L layers must run sequentially. Full MoA can have 30–120s latency vs 5–15s for a single query.

Medium

If proposers are too similar (same model families, similar training data), they can mutually reinforce errors. Diversity of model selection is critical.

Medium

A weak aggregator model cannot exploit diverse proposals well — the aggregator must have higher reasoning ability than most proposers.

Low

The Aggregate-and-Synthesize prompt significantly affects results — wording variations can yield 5–10 pp differences on benchmarks.

Evolution

Original paper · 2024

Mixture-of-Agents Enhances Large Language Model Capabilities

, , , ,

Wang et al. publish "Mixture-of-Agents Enhances Large Language Model Capabilities" on arXiv (2406.04692). Together AI releases an implementation alongside the paper.

Together MoA reaches 65.1% on AlpacaEval 2.0, beating GPT-4 Omni (57.5%) — the first time an OSS configuration beats a closed frontier model on this benchmark.

MoA implementations appear in LangChain, LlamaIndex, CrewAI, and other agent orchestration frameworks.

Technical details

Hyperparameters (configurable axes)

Number of layers L

Typical values: 2–6. Together MoA uses L=4. Greater depth = higher quality but linearly growing cost and latency.

Number of proposers n per layer

Typical values: 3–6. Higher n = greater diversity but exponentially growing token cost (each proposer sees the concatenation of n proposals).

Model selection (performance vs diversity)

Trade-off between selecting the strongest models (all GPT-4-class) versus diversity (heterogeneity of families: Qwen + Llama + Mixtral + DBRX). Wang et al. recommend diversity for proposers, performance for the aggregator.

Aggregate-and-Synthesize prompt

Variant of the prompt used by agents to interpret previous-layer proposals — has a significant impact on final-answer quality.

Token budget / cost

Full MoA (4 layers × 6 proposers) vs MoA-Lite (3 layers with fewer proposers). Lite reaches 59.3% AlpacaEval at significantly lower cost.

Computational complexity

Computational characteristics

→LLM call count: L × n + 1 (reference: 4 × 6 + 1 = 25 calls per query)

→Token cost: 20–100× a single call, depending on L, n, and proposal length

→Latency: sequential layers = L × max(per-proposer latency), typically 30–120s

→Memory scaling: linear in n proposers per layer (context grows with n)

→Hardware: agnostic — MoA is API-call orchestration, works with any LLM provider

→Together MoA cost: ~$0.50–$2.00 per query (6 OSS models, 4 layers) vs ~$0.10–$0.50 GPT-4 single call

→AlpacaEval 2.0: Together MoA 65.1%, GPT-4 Omni 57.5%, MoA-Lite 59.3% — MoA beats closed frontier models at higher cost

Benchmark notes

MoA is evaluated primarily on 3 benchmarks. (1) AlpacaEval 2.0 (LC win rate): Together MoA 65.1%, MoA-Lite 59.3%, GPT-4 Omni 57.5%, GPT-4 Turbo 55.0%. The first instance where an OSS configuration beats a frontier closed model on AlpacaEval. (2) MT-Bench: Together MoA 9.25, GPT-4 Turbo 9.32, GPT-4 Omni 9.19 — MoA comparable or marginally below closed models. (3) FLASK (fine-grained skill assessment): Together MoA surpasses GPT-4 Omni in 10 of 12 evaluation dimensions, including robustness, correctness, and efficiency. Caveats: MoA cost is 5–10× higher than a single GPT-4o call; for latency-sensitive applications (real-time chat) MoA can be unacceptably slow (30–120s vs 5–15s). The Wang et al. benchmarks do not cover math or HumanEval/MATH-level coding — those domains require separate validation.

Execution paradigm

Primary mode

always_on

All agents in a layer are activated for every query (parallel fan-out), but the choice of proposers and aggregator is input-independent — defined statically in the architecture configuration.

Activation pattern

input_dependent

Routing mechanism

MoA does not use dynamic routing — every query passes through all proposers in every layer. Heterogeneity comes from static selection of different models, not runtime routing.

Parallelism

Parallelism level

highly_parallel

Within a single MoA layer it is fully parallel — n proposers called concurrently. Between layers: sequentiality enforced by the y_l → y_{l+1} dependency.

Scope

inferenceacross_devices

Constraints

!All proposers in layer l must finish generating before layer l+1 starts — the synchronization barrier introduces latency equal to max(proposer time).

!Parallel calls to multiple LLMs from different providers may hit throttling — careful quota management is required.

!Concatenating proposals from the previous layer grows the next layer's context — with n proposers and long answers, context grows linearly with n.

Sources

Mixture-of-Agents Enhances Large Language Model Capabilities (arXiv)

Together MoA — Together AI blog announcement

togethercomputer/MoA — official implementation