Robots Atlas>ROBOTS ATLAS
Training

DMD2

2024ActiveDraft
Key innovation
Distills a multi-step diffusion model into a 1–4 step generator without the costly teacher-generated dataset regression, by leveraging a two time-scale update rule (TTUR) and an additional GAN signal against real data.
Category
Training
Abstraction level
Pattern
Use cases
Distillation of Stable Diffusion XL to 1–4 stepsReal-time image generationMobile text-to-image applicationsLow-latency image editingProduction text-to-image inference scaling

How it works

The DMD2 pipeline uses three networks: (1) a frozen teacher — a pretrained diffusion model (e.g. SDXL) providing the „real score” approximating the gradient of the log of the true data distribution; (2) generator G — trained to map noise to images in 1–4 steps; (3) fake score model — trained in parallel to track the distribution produced by G. The distillation loss is the KL divergence between G's distribution and the real data distribution; its gradient with respect to G's parameters equals (score_fake − score_real) backpropagated through the generator. TTUR updates the fake score model every step but the generator every few steps, preventing instability. Additionally, a GAN discriminator is trained to distinguish G's samples from real images; its signal adds an adversarial loss improving detail quality. The entire procedure requires no teacher-generated dataset, distinguishing DMD2 from DMD v1.

Problem solved

Multi-step diffusion models (DDIM/DDPM) require 25–50 network evaluations per image, making them expensive at inference and impractical for real-time applications. Earlier distillation methods (Progressive Distillation, Consistency Models, DMD v1) either lost quality or required a costly upfront regression phase on a teacher-generated dataset. DMD2 solves both: it maintains near-teacher quality at 1–4 steps, eliminates dataset pre-generation, and is more stable during training thanks to TTUR.

Key mechanisms

Distribution Matching loss (KL divergence)
Two Time-scale Update Rule (TTUR)
Auxiliary GAN loss against real data
Multi-step generator (separate noise level heads)
Elimination of regression loss

Strengths & limitations

Strengths
1–4 inference steps vs 25–50 for DDIM
Near-teacher quality (FID comparable to SDXL)
No need to pre-generate a teacher dataset
More stable training than DMD v1 thanks to TTUR
Multi-step configuration support (quality–speed trade-off)
Limitations
Requires concurrent training of three networks (frozen teacher, generator, fake score model)
GAN discriminator adds hyperparameter sensitivity
Quality gains mostly on high-frequency detail — 1-step composition can be weaker
Limited to diffusion model distillation (not applicable to other generative model types)

Implementation