MoA
How it works
The MoA architecture consists of L layers. Each layer l (for l < L) contains n proposer agents A_{l,1}, ..., A_{l,n}, where each agent is a call to a specific LLM with a dedicated prompt. The input to layer l is: (1) the original user query x, (2) the concatenation of all responses from layer l-1: y_{l-1} = [y_{l-1,1}, ..., y_{l-1,n}]. Each agent A_{l,i} receives a prompt of the form "Here is query x and proposals from the previous layer y_{l-1}. Produce an improved response" and outputs y_{l,i}. All proposals y_{l,1}, ..., y_{l,n} are independent (parallel calls). The final layer L contains a single aggregator that receives proposals from layer L-1 and generates the final response. Model selection follows two criteria: (a) performance โ stronger models on a given task type as aggregators, (b) diversity โ proposers from different model families (heterogeneous error profiles). Together MoA reference uses 4 layers with 6 agents in each.
Problem solved
A single LLM, even the strongest, is bounded by its training data, biases, and knowledge gaps. Classical LLM ensembling (e.g., self-consistency, majority voting) scales poorly because all votes come from the same model or models with similar error profiles. Conversely, standalone calls to different LLMs (zero-shot multi-model voting) fail to exploit their mutual complementarity. MoA addresses this through structural composition: agents from different models see each other's proposals and can critique, refine, and synthesize them. This unlocks the complementarity of different models' strengths (e.g., math, coding, language) without training, fine-tuning, or external infrastructure.
Key mechanisms
Strengths & limitations
Components
An LLM agent in layer l < L that independently generates a response proposal based on the query and the previous layer's proposals. Typically n=6 proposers per layer in the reference Together MoA.
An LLM agent in the final layer L that synthesizes proposals from layer L-1 into one coherent answer. Selected as the strongest model for the given task.
A sequence of L layers where outputs of layer l are concatenated and fed as context to layer l+1. Typically L=4 in the reference Together configuration.
A special prompt instructing the agent how to interpret previous-layer proposals: critical evaluation, error identification, synthesis of best elements. The entire architecture is defined by this prompt plus a static topology.
Implementation
Each proposer in layer l receives a concatenation of n proposals from layer l-1. For L=4, n=6, and 500-token responses: layer 4 receives ~3000 tokens of intermediate context ร 6 calls = huge overhead. Set hard max_tokens limits for proposers.
Despite intra-layer parallelism, L layers must run sequentially. Full MoA can have 30โ120s latency vs 5โ15s for a single query.
If proposers are too similar (same model families, similar training data), they can mutually reinforce errors. Diversity of model selection is critical.
A weak aggregator model cannot exploit diverse proposals well โ the aggregator must have higher reasoning ability than most proposers.
The Aggregate-and-Synthesize prompt significantly affects results โ wording variations can yield 5โ10 pp differences on benchmarks.
Evolution
Technical details
Hyperparameters (configurable axes)
Typical values: 2โ6. Together MoA uses L=4. Greater depth = higher quality but linearly growing cost and latency.
Typical values: 3โ6. Higher n = greater diversity but exponentially growing token cost (each proposer sees the concatenation of n proposals).
Trade-off between selecting the strongest models (all GPT-4-class) versus diversity (heterogeneity of families: Qwen + Llama + Mixtral + DBRX). Wang et al. recommend diversity for proposers, performance for the aggregator.
Variant of the prompt used by agents to interpret previous-layer proposals โ has a significant impact on final-answer quality.
Full MoA (4 layers ร 6 proposers) vs MoA-Lite (3 layers with fewer proposers). Lite reaches 59.3% AlpacaEval at significantly lower cost.
Computational complexity
MoA is evaluated primarily on 3 benchmarks. (1) AlpacaEval 2.0 (LC win rate): Together MoA 65.1%, MoA-Lite 59.3%, GPT-4 Omni 57.5%, GPT-4 Turbo 55.0%. The first instance where an OSS configuration beats a frontier closed model on AlpacaEval. (2) MT-Bench: Together MoA 9.25, GPT-4 Turbo 9.32, GPT-4 Omni 9.19 โ MoA comparable or marginally below closed models. (3) FLASK (fine-grained skill assessment): Together MoA surpasses GPT-4 Omni in 10 of 12 evaluation dimensions, including robustness, correctness, and efficiency. Caveats: MoA cost is 5โ10ร higher than a single GPT-4o call; for latency-sensitive applications (real-time chat) MoA can be unacceptably slow (30โ120s vs 5โ15s). The Wang et al. benchmarks do not cover math or HumanEval/MATH-level coding โ those domains require separate validation.
Execution paradigm
All agents in a layer are activated for every query (parallel fan-out), but the choice of proposers and aggregator is input-independent โ defined statically in the architecture configuration.
MoA does not use dynamic routing โ every query passes through all proposers in every layer. Heterogeneity comes from static selection of different models, not runtime routing.
Parallelism
Within a single MoA layer it is fully parallel โ n proposers called concurrently. Between layers: sequentiality enforced by the y_l โ y_{l+1} dependency.