Multi-Token Prediction
Training a language model to predict n future tokens at once (instead of one) using n independent output heads on a shared backbone — yielding better model quality, higher sample efficiency, and native support for speculative decoding without a separate drafter model.
The MTP architecture consists of a shared transformer backbone and n independent output heads. Each head predicts the token at position t+1, t+2, ..., t+n given the current context. The loss is the sum of cross-entropy losses across all n heads. Heads typically share the input embedding layer but have separate output projections. At inference, one can use only the first head (preserving compatibility with next-token sampling) or all n heads as a native drafter in speculative decoding — head 1 emits the next token, heads 2..n propose continuations, and the model verifies them all in a single step. The shared backbone and KV-cache eliminate the typical draft+target implementation pitfalls.
The standard next-token prediction loss trains the model on short-sighted, local dependencies. This causes weaker sample efficiency and forces a separate drafter model for speculative decoding (with the burden of coordinating two weight sets, KV-caches, and tokenizers). MTP addresses both at once: better training signal plus a native drafter inside the model.
Number of prediction heads (n)
- 4Value used by Meta in the source paper.
- 1 (DeepSeek-V3)DeepSeek-V3 uses a single extra MTP head as auxiliary objective.
Number of future tokens the model learns to predict in parallel. Increasing n past some point yields diminishing quality gains and raises training cost.
Auxiliary loss weight
Weight of the MTP loss relative to the next-token loss. Too high degrades main-head quality; too low reduces benefit.
GENESIS · Source paper
Better & Faster Large Language Models via Multi-token PredictionMTP introduced (Meta AI)
breakthroughGloeckle et al. formalize the training objective and show that 13B models trained with 4-token prediction solve 12% more HumanEval and 17% more MBPP than next-token-only. Inference up to 3x faster even at large batch sizes.
DeepSeek-V3 uses MTP at 671B scale
DeepSeek-V3 (671B MoE, 37B activated) adopts MTP as an auxiliary training objective to strengthen quality. Open-weight model, trained with 2.788M H800 GPU-hours.
Gemma 4 MTP drafter models (Google)
Google releases on May 6, 2026 experimental MTP drafter models for the Gemma 4 family under Apache 2.0 — 74M-parameter drafters for multi-billion-parameter targets. Supported by MLX, vLLM, SGLang, Ollama. 2.8x and 3.1x speedup on Pixel (E2B/E4B), 2.5x on Apple M4 (31B), 2x on RTX PRO 6000 (26B). No quality loss.
Gemma 4 MTP drafters run on consumer GPUs (RTX PRO 6000) with 2x speedup, and on Pixel mobile GPUs with 2.8x–3.1x.
Apple Silicon (M4) with unified memory achieves 2.5x speedup on Gemma 4 31B via MLX.
BUILT ON
Transformer
Transformer is a neural network architecture proposed by Vaswani et al. in „Attention Is All You Need" (NeurIPS 2017). It replaced earlier approaches based on recurrent (RNN, LSTM) and convolutional (CNN) networks in sequential tasks. The key element is the multi-head self-attention mechanism, which allows every position in a sequence to directly participate in computations involving every other position, enabling the model to learn long-range dependencies in constant (not linear, as in RNNs) time. The architecture consists of encoder and decoder blocks (or encoder-only / decoder-only variants) containing: multi-head attention layers, feed-forward networks, residual connections, and layer normalization (LayerNorm). Sequence positions are encoded via positional encoding (sinusoidal or learned). Transformer has become the foundation of LLMs (GPT, BERT, T5, LLaMA, Claude, Gemini), Vision Transformers (ViT), multimodal models (CLIP, Flamingo), and tabular foundation models (TabPFN). The main limitation — quadratic attention complexity with respect to sequence length (O(n²)) — is an active research direction (FlashAttention, sliding window, linear attention, SSM).
GO TO CONCEPTLLM
A Large Language Model (LLM) is a class of machine learning models based on the Transformer architecture, trained on large text datasets via autoregressive language modeling (next-token prediction). These models have billions of parameters and can generate coherent text, answer questions, write code, translate languages, and perform many other language-cognitive tasks without task-specific fine-tuning. The term covers models such as GPT, LLaMA, Gemini, Claude, and Mistral. Most modern LLMs are instruction-tuned (SFT + RLHF) after the pre-training phase.
GO TO CONCEPTEXTENDS
Speculative Decoding
Speculative Decoding is an inference acceleration algorithm introduced by Yaniv Leviathan, Matan Kalman, and Yossi Matias (Google) in "Fast Inference from Transformers via Speculative Decoding" (NeurIPS/ICML 2023 Oral), and concurrently by Charlie Chen and the DeepMind team in "Accelerating Large Language Model Decoding with Speculative Sampling" (February 2023). The technique exploits the observation that many sub-tasks of generation are simple enough that a much smaller model can predict them accurately — when the small model agrees with the large model preference, several tokens can be accepted in a single step. Implementation only requires a single forward pass of the target model to verify a token sequence produced by the drafter, so generating n tokens needs on average one (rather than n) loads of the target model parameters from memory. The algorithm is mathematically proven distribution-preserving — it uses modified rejection sampling that guarantees an output distribution identical to standard decoding. The speedup is largest when the inference bottleneck is memory bandwidth (typically consumer GPUs, mobile devices), because compute is then underutilized and an extra parallel forward pass costs relatively little.
GO TO CONCEPTPretraining
Pretraining (self-supervised pretraining) is the first and most expensive stage in building modern foundation models. The model learns to predict missing or next portions of data — next tokens in text, masked words, future video frames, future robot states — without human labels. This unlocks virtually unlimited raw data (web crawls, code, books, YouTube video, robot telemetry). The result is a set of weights encoding "world knowledge" — dense statistical representations that can later be fine-tuned, instruction-tuned, or RLHF-aligned for any downstream task. Pretraining underpins GPT, BERT, CLIP, Llama, Gemini, and robotics foundation models (Pi-Zero, Gemini Robotics, Ti0).
GO TO CONCEPTCommonly used with
Speculative Decoding
Speculative Decoding is an inference acceleration algorithm introduced by Yaniv Leviathan, Matan Kalman, and Yossi Matias (Google) in "Fast Inference from Transformers via Speculative Decoding" (NeurIPS/ICML 2023 Oral), and concurrently by Charlie Chen and the DeepMind team in "Accelerating Large Language Model Decoding with Speculative Sampling" (February 2023). The technique exploits the observation that many sub-tasks of generation are simple enough that a much smaller model can predict them accurately — when the small model agrees with the large model preference, several tokens can be accepted in a single step. Implementation only requires a single forward pass of the target model to verify a token sequence produced by the drafter, so generating n tokens needs on average one (rather than n) loads of the target model parameters from memory. The algorithm is mathematically proven distribution-preserving — it uses modified rejection sampling that guarantees an output distribution identical to standard decoding. The speedup is largest when the inference bottleneck is memory bandwidth (typically consumer GPUs, mobile devices), because compute is then underutilized and an extra parallel forward pass costs relatively little.
GO TO CONCEPTMoE
Mixture of Experts (MoE) is an architecture in which a model is composed of multiple parallel sub-networks — the experts — along with a gating (routing) network that determines, for each input, which subset of experts to activate and how to combine their outputs. The gating network produces a weighting over experts; in the original soft formulation (Jacobs et al., 1991), all experts are weighted and summed. In the sparse formulation (Shazeer et al., 2017), only the top-k scoring experts are activated, and the remaining experts produce no output and incur no compute cost for that input. In the context of large language models, MoE is typically applied as a replacement for the feed-forward network (FFN) sub-layer within each Transformer block. Each token is routed to a small number of expert FFNs (commonly top-1 or top-2), with the router being a learned linear projection followed by a softmax. The outputs of the selected experts are weighted by the corresponding router scores and summed. A central challenge in sparse MoE is load balancing: without explicit regularization, the router tends to collapse onto a small set of preferred experts, leaving others undertrained. This is addressed via auxiliary load balancing losses added to the training objective, which encourage a roughly uniform distribution of tokens across experts. Expert parallelism is the standard distributed training and inference strategy for large MoE models: each expert is placed on a separate device, so that the total parameter count scales with the number of devices without increasing per-device memory or per-token FLOPs proportionally. The capacity factor controls the maximum number of tokens each expert can process per batch; tokens that overflow the capacity are either dropped or passed through a residual connection. Tuning the capacity factor is a critical practical consideration.
GO TO CONCEPT| Title | Publisher | Type |
|---|---|---|
| Better & Faster Large Language Models via Multi-token Prediction Source paper from Meta AI/FAIR — Gloeckle, Idrissi, Rozière, Lopez-Paz, Synnaeve. | arXiv | scientific article |
| DeepSeek-V3 Technical Report DeepSeek-V3 uses MTP as an auxiliary training objective. | arXiv | scientific article |
| Ars Technica — Google's Gemma 4 AI models get 3x speed boost by predicting future tokens Coverage of the Gemma 4 MTP release (May 6, 2026). | Ars Technica | article |
Source paper from Meta AI/FAIR — Gloeckle, Idrissi, Rozière, Lopez-Paz, Synnaeve.
DeepSeek-V3 uses MTP as an auxiliary training objective.
Coverage of the Gemma 4 MTP release (May 6, 2026).