Multi-Token Prediction (MTP) — multi-head LLM training | AI Technologies | Robots Atlas

LLM pretraining with better sample efficiency and generation qualityCode models (HumanEval, MBPP) — particularly strong gainsOpen-weight models optimized for local inference (Gemma 4, DeepSeek-V3)Drafter in speculative decoding without the cost of a separate model

The MTP architecture consists of a shared transformer backbone and n independent output heads. Each head predicts the token at position t+1, t+2, ..., t+n given the current context. The loss is the sum of cross-entropy losses across all n heads. Heads typically share the input embedding layer but have separate output projections. At inference, one can use only the first head (preserving compatibility with next-token sampling) or all n heads as a native drafter in speculative decoding — head 1 emits the next token, heads 2..n propose continuations, and the model verifies them all in a single step. The shared backbone and KV-cache eliminate the typical draft+target implementation pitfalls.

The standard next-token prediction loss trains the model on short-sighted, local dependencies. This causes weaker sample efficiency and forces a separate drafter model for speculative decoding (with the burden of coordinating two weight sets, KV-caches, and tokenizers). MTP addresses both at once: better training signal plus a native drafter inside the model.

Number of prediction heads (n)

Critical

4Value used by Meta in the source paper.
1 (DeepSeek-V3)DeepSeek-V3 uses a single extra MTP head as auxiliary objective.

Number of future tokens the model learns to predict in parallel. Increasing n past some point yields diminishing quality gains and raises training cost.

Auxiliary loss weight

Standard

Weight of the MTP loss relative to the next-token loss. Too high degrades main-head quality; too low reduces benefit.

Reference implementations

Meta MTP (arXiv 2404.19737)official

Python · Meta AI / FAIR

DeepSeek-V3 (open weights)official

Python · DeepSeek AI

Gemma 4 MTP drafters (MLX, vLLM, SGLang, Ollama)official

Python · Google

GENESIS · Source paper

Better & Faster Large Language Models via Multi-token Prediction

2024arXiv preprint; Meta AI / FAIRFabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière et al.

2024

MTP introduced (Meta AI)

breakthrough

Gloeckle et al. formalize the training objective and show that 13B models trained with 4-token prediction solve 12% more HumanEval and 17% more MBPP than next-token-only. Inference up to 3x faster even at large batch sizes.

Better & Faster Large Language Models via Multi-token Prediction

2024

DeepSeek-V3 uses MTP at 671B scale

DeepSeek-V3 (671B MoE, 37B activated) adopts MTP as an auxiliary training objective to strengthen quality. Open-weight model, trained with 2.788M H800 GPU-hours.

DeepSeek-V3 Technical Report

2026

Gemma 4 MTP drafter models (Google)

Google releases on May 6, 2026 experimental MTP drafter models for the Gemma 4 family under Apache 2.0 — 74M-parameter drafters for multi-billion-parameter targets. Supported by MLX, vLLM, SGLang, Ollama. 2.8x and 3.1x speedup on Pixel (E2B/E4B), 2.5x on Apple M4 (31B), 2x on RTX PRO 6000 (26B). No quality loss.

GPU Tensor CoresPRIMARY

Gemma 4 MTP drafters run on consumer GPUs (RTX PRO 6000) with 2x speedup, and on Pixel mobile GPUs with 2.8x–3.1x.

CPU AVXGOOD

Apple Silicon (M4) with unified memory achieves 2.5x speedup on Gemma 4 31B via MLX.

BUILT ON

Transformer

Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).

GO TO CONCEPT

LLM

A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.

GO TO CONCEPT

EXTENDS

Speculative Decoding

Speculative Decoding is an inference acceleration algorithm introduced by Yaniv Leviathan, Matan Kalman, and Yossi Matias (Google) in "Fast Inference from Transformers via Speculative Decoding" (NeurIPS/ICML 2023 Oral), and concurrently by Charlie Chen and the DeepMind team in "Accelerating Large Language Model Decoding with Speculative Sampling" (February 2023). The technique exploits the observation that many sub-tasks of generation are simple enough that a much smaller model can predict them accurately — when the small model agrees with the large model preference, several tokens can be accepted in a single step. Implementation only requires a single forward pass of the target model to verify a token sequence produced by the drafter, so generating n tokens needs on average one (rather than n) loads of the target model parameters from memory. The algorithm is mathematically proven distribution-preserving — it uses modified rejection sampling that guarantees an output distribution identical to standard decoding. The speedup is largest when the inference bottleneck is memory bandwidth (typically consumer GPUs, mobile devices), because compute is then underutilized and an extra parallel forward pass costs relatively little.

GO TO CONCEPT

Pretraining

Pretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data — next tokens in text, masked words, future video frames, future robot states — without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" — dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).

GO TO CONCEPT

Commonly used with

Speculative Decoding

Speculative Decoding is an inference acceleration algorithm introduced by Yaniv Leviathan, Matan Kalman, and Yossi Matias (Google) in "Fast Inference from Transformers via Speculative Decoding" (NeurIPS/ICML 2023 Oral), and concurrently by Charlie Chen and the DeepMind team in "Accelerating Large Language Model Decoding with Speculative Sampling" (February 2023). The technique exploits the observation that many sub-tasks of generation are simple enough that a much smaller model can predict them accurately — when the small model agrees with the large model preference, several tokens can be accepted in a single step. Implementation only requires a single forward pass of the target model to verify a token sequence produced by the drafter, so generating n tokens needs on average one (rather than n) loads of the target model parameters from memory. The algorithm is mathematically proven distribution-preserving — it uses modified rejection sampling that guarantees an output distribution identical to standard decoding. The speedup is largest when the inference bottleneck is memory bandwidth (typically consumer GPUs, mobile devices), because compute is then underutilized and an extra parallel forward pass costs relatively little.

GO TO CONCEPT

MoE

Mixture of Experts (MoE) is an architecture in which a model is composed of multiple parallel sub-networks — the experts — along with a gating (routing) network that determines, for each input, which subset of experts to activate and how to combine their outputs. The gating network produces a weighting over experts; in the original soft formulation (Jacobs et al., 1991), all experts are weighted and summed. In the sparse formulation (Shazeer et al., 2017), only the top-k scoring experts are activated, and the remaining experts produce no output and incur no compute cost for that input. In the context of large language models, MoE is typically applied as a replacement for the feed-forward network (FFN) sub-layer within each Transformer block. Each token is routed to a small number of expert FFNs (commonly top-1 or top-2), with the router being a learned linear projection followed by a softmax. The outputs of the selected experts are weighted by the corresponding router scores and summed. A central challenge in sparse MoE is load balancing: without explicit regularization, the router tends to collapse onto a small set of preferred experts, leaving others undertrained. This is addressed via auxiliary load balancing losses added to the training objective, which encourage a roughly uniform distribution of tokens across experts. Expert parallelism is the standard distributed training and inference strategy for large MoE models: each expert is placed on a separate device, so that the total parameter count scales with the number of devices without increasing per-device memory or per-token FLOPs proportionally. The capacity factor controls the maximum number of tokens each expert can process per batch; tokens that overflow the capacity are either dropped or passed through a residual connection. Tuning the capacity factor is a critical practical consideration.

GO TO CONCEPT

Title	Publisher	Type
Better & Faster Large Language Models via Multi-token Prediction Source paper from Meta AI/FAIR — Gloeckle, Idrissi, Rozière, Lopez-Paz, Synnaeve.	arXiv	scientific article
DeepSeek-V3 Technical Report DeepSeek-V3 uses MTP as an auxiliary training objective.	arXiv	scientific article
Ars Technica — Google's Gemma 4 AI models get 3x speed boost by predicting future tokens Coverage of the Gemma 4 MTP release (May 6, 2026).	Ars Technica	article

Better & Faster Large Language Models via Multi-token Prediction

Source paper from Meta AI/FAIR — Gloeckle, Idrissi, Rozière, Lopez-Paz, Synnaeve.

scientific articlearXiv

DeepSeek-V3 Technical Report

DeepSeek-V3 uses MTP as an auxiliary training objective.

scientific articlearXiv

Ars Technica — Google's Gemma 4 AI models get 3x speed boost by predicting future tokens

Coverage of the Gemma 4 MTP release (May 6, 2026).

articleArs Technica

Multi-Token Prediction

Use cases

How it works

Problem solved

Configuration axes

Implementation

Reference implementations

History and evolution

Preferred hardware

Semantic relations

BUILT ON

EXTENDS

Commonly used with

Sources