NVIDIA Nemotron-Labs-Diffusion: AR and diffusion in one model

NVIDIA has released the Nemotron-Labs-Diffusion model family in 3B, 8B, and 14B sizes — the first production-grade implementation combining classical autoregression with block-wise diffusion decoding. The models are open-source on Hugging Face, with a technical report published on NVIDIA Research.

Key takeaways

NVIDIA released three language models (3B, 8B, 14B parameters) as open-source — anyone can download them, run locally, and fine-tune them for their own needs.
The models can write text in three different ways and switch between them on the fly: classic letter-by-letter, generating entire blocks of tokens in parallel, or "speculatively" — quickly guessing and verifying.
For a single user (e.g. assistant chat, local coding) the models generate text up to 3-4 times faster than classic LLMs of comparable size — at the same answer quality.
On the latest NVIDIA GB200 platform the 8B model beats the competition (Qwen3-Eagle3) by 40% in throughput — at the same number of concurrent users.
Answer quality (math, coding, reasoning) holds up against Qwen3-8B and Mistral 8B — the top tier in this size class. There is no trade-off between speed and intelligence.
Caveat: the speed advantage disappears on servers handling hundreds of concurrent users. NLD is a weapon for single-user scenarios, not for mass-API platforms.

The third mode — Self-Speculation — is the key novelty

To understand Self-Speculation, you first need to know how the other two modes work.

AR mode (autoregression) is the classic — the model generates text token by token, left to right. Like writing a sentence letter by letter: quick to start, but every token requires a full pass through the network. Slow but predictable.
Diffusion mode works differently — the model is given a whole block of masked tokens (e.g. 32 empty slots) and fills them in parallel across a few "denoising" rounds. Like solving a crossword: you see the entire grid at once and write words into many cells simultaneously. Faster than AR, but quality can suffer because the model has to guess all positions at once.

Self-Speculation is the trick that combines both. The idea: the same model is run in two roles simultaneously.

"Draft" version (lightweight layer configuration) — runs fast and proposes 6-8 candidate next tokens. Think of an assistant blurting out a quick guess.
"Verifier" version (full configuration) — checks all candidates in a single pass. Accepts those it would have generated itself, rejects the rest.

The key gain: one pass through the large model = 5.9 accepted tokens (TPF — tokens per forward pass), instead of 1 as in AR. That is where the 3-4× speedups come from.

It is an adaptation of Speculative Decoding with one twist: instead of training a separate small "drafter" model (as Eagle3 does), NVIDIA uses the same model in a reduced configuration. Hence the name Self-Speculation — the model speculates against itself. Less infrastructure, fewer parameters to maintain.

An important caveat about deployment scale. The entire Self-Speculation and Diffusion gain vanishes when the server handles many concurrent users. Why:

Low concurrency (1-64 users at once): the GPU has spare compute — Self-Speculation fills it by verifying candidates. Each extra "shot" costs almost nothing because the processor would be idle anyway. NLD wins here.
High concurrency (64+ users at once): the GPU is already fully loaded running AR for all sessions in parallel. Tossing in extra candidates to verify buys nothing because there are no free resources. Classical AR wins here — and NVIDIA openly admits this, recommending mode switching depending on server load.

In other words: NLD shines in single-user / low-concurrency scenarios (ChatGPT-style 1-on-1 chat, coding assistant, local inference on a workstation). For a mass API platform serving thousands of concurrent requests the advantage disappears.

Diagram of three Nemotron-Labs-Diffusion inference modes: AR, Diffusion, Self-Speculation plus the shared dual-loss training

Training recipe — Dual Loss and Global Loss Averaging

The training recipe starts from Ministral 3B/8B/14B checkpoints. Phase one: 1 trillion tokens of AR-only pre-training. Phase two: 300 billion tokens of joint AR + Diffusion training using Global Loss Averaging — both loss signals are averaged to eliminate gradient instability when training two heads on the same backbone. This is followed by SFT and VLM alignment.

Key implementation techniques: Full Loss Averaging for training stability, DP-rank Variable Encoding for flexible sequence length handling, Strict Causal Masking (blocking backward attention leakage), and LoRA-grown draft — a lightweight side model derived from the main model's weights via LoRA, with no external parameters.

Benchmarks: where the model stands out and where it is limited

The most significant benchmark result is SPEED-Bench — measuring inference efficiency in low-batch scenarios. Nemotron-Labs-Diffusion-8B achieves a mean accepted length of 8.7 tokens per step, translating to 5.9 TPF on GB200. For comparison: Qwen3-5B-MTP achieves 4.7 TPF, and Qwen3-8B-Eagle3 — 2.81 TPF. These figures apply to single-user inference; at high concurrency the picture changes.

On standard quality benchmarks (QA, coding, math, reasoning), Nemotron-Labs-Diffusion-8B scores close to or better than prior dLLMs (LLaDA, Dream, SDAR), with 9–22.4% improvement on large test sets. NVIDIA clarifies that the primary advantage lies in efficiency metrics, not state-of-the-art accuracy — the models target practical deployment over leaderboard wins.

NLD 8B speedup vs. AR baseline (single user)

Platform	Precision	Speedup vs. AR
DGX Spark	FP8	3.14×
DGX Spark	INT4	2.7× (112 vs 41.8 tok/s AR)
RTX Pro 6000	FP8	3.4×
RTX Pro 6000	INT4	2.3×
GB200	FP8	3.3× (850 tok/s)

Bar chart: NLD 8B speedup over autoregression — 2.4× (diffusion), 3.4× (self-speculation, H100), 4.0× (GB200), 4.8× (GB200 + optimized)

Positioning against prior dLLMs and AR competitors

In the dLLM ecosystem, earlier projects include LLaDA (Meta AI Research), Dream, and SDAR — all experimental, without full production inference pipelines. Nemotron-Labs-Diffusion is the first in this class with: an integrated Self-Speculation mode, published training recipes, and NLD (NVIDIA Language Deployment) client support. On the autoregressive side, the primary comparison point is Qwen3-8B-Eagle3 — here Nemotron achieves 1.4× higher throughput on GB200 at the same session count.

Important caveat: the throughput advantage is visible at low concurrency (<64 sessions). At high traffic (>64 sessions) AR has comparable or better system throughput — a fact NVIDIA acknowledges directly in the technical report, available at cloudfront.net.

Quality benchmarks — comparison with competitors

The table below contrasts average accuracy scores for Nemotron-Labs-Diffusion-8B against competing dLLMs and the autoregressive Qwen3-8B baseline (source: NVIDIA technical report, Table 3).

Model	QA + Instruct	Coding	Math	Average
Qwen3-8B (AR)	68.21	49.45	88.28	64.85
Qwen3-4B (AR)	67.37	36.20	85.20	62.75
Ministral-8B (AR)	63.07	38.07	70.91	57.36
LLaDA-8B (dLLM)	46.32	7.32	11.00	24.71
Dream-7B (dLLM)	54.50	24.07	46.10	40.45
SDAR-8B (dLLM)	58.06	24.05	53.94	43.69
NLD-8B (Diff)	64.41	36.49	74.50	57.29
NLD-8B (Quad SS)	67.42	38.07	78.95	60.83

The key takeaway: Nemotron-Labs-Diffusion-8B in Quadratic Self-Speculation mode achieves 60.83 average accuracy — the highest among diffusion models (LLaDA 24.71, Dream 40.45, SDAR 43.69) and comparable to the autoregressive Qwen3-4B (62.75). It trails Qwen3-8B (64.85) by ~4 points, an acceptable trade-off given the 3× higher throughput in production deployments.

Pareto diagram: GPU vs per-user throughput

The strongest architectural argument comes from Pareto analysis. The ideal position on such a chart is the upper-right corner — high throughput both for a single user and across the entire GPU. Nemotron-Labs-Diffusion-8B in Self-Speculation mode dominates regions where autoregression (black curve) and Qwen3-8B-Eagle3 (cyan) must trade one metric for the other.

Pareto chart: GPU throughput vs per-user throughput for AR, Qwen3-Eagle3 and NLD-8B on GB200 — NLD curve dominates

The NLD curve (green) sits above and to the right of competitors across the entire range. At c=128 (128 concurrent sessions), Self-Speculation reaches roughly 10,000 tok/sec total on a GB200 GPU — 3.3× more than classical AR and 1.4× more than Qwen3-8B-Eagle3 at the same per-user throughput.

Training recipe — phases

The Nemotron-Labs-Diffusion training pipeline consists of four phases. The starting point is the Ministral 3B/8B/14B weights, and the final model undergoes alignment for visual tasks (VLM):

Phase	Tokens	Training type	Purpose
1	1T	AR-only pretraining	Stable language base from Ministral
2	300B	AR + Diffusion joint	Introduce diffusion head, Global Loss Averaging
3	—	SFT (Supervised Fine-Tuning)	Instruction and preference alignment
4	—	VLM alignment	Visual multimodality

Why this matters

Nemotron-Labs-Diffusion is a meaningful confirmation that dLLMs can be practical production tools, not just academic experiments. For years, text diffusion lagged behind autoregression in both output quality and hardware efficiency. NVIDIA demonstrates that with the right training recipe and a hybrid inference architecture, these disadvantages can be reversed — at least in single-user and low-concurrency scenarios.

The core forward-looking claim concerns sampling quality. The report explicitly states: if a better Trained Sampler for diffusion mode is developed, the theoretical upper bound for dLLM advantage over AR exceeds 76.5% — not a marketing projection, but an analysis of parallel forward-pass counts per token. For inference framework developers (vLLM, TRT-LLM), this opens a new optimization front.

The open question is scalability to 70B+ models, where diffusion forward-pass costs grow faster than in AR. NVIDIA stays below 15B for now, suggesting this remains unsolved. As an open-source research baseline, however, this is the strongest publicly available dLLM foundation in 2025.

What comes next

Training recipe is published — applicable to other backbones; community fine-tunes on Hugging Face expected
Key bottleneck is Trained Sampler quality — NVIDIA identifies it as the primary vector for future improvements (potential 76.5% advantage over AR)
NLD (NVIDIA Language Deployment) is expected to integrate diffusion modes in future releases — no date on roadmap, but current results serve as proof of concept