Training

DMD2

2024ActiveDraft

Key innovation

Distills a multi-step diffusion model into a 1–4 step generator without the costly teacher-generated dataset regression, by leveraging a two time-scale update rule (TTUR) and an additional GAN signal against real data.

How it works

The DMD2 pipeline uses three networks: (1) a frozen teacher — a pretrained diffusion model (e.g. SDXL) providing the „real score” approximating the gradient of the log of the true data distribution; (2) generator G — trained to map noise to images in 1–4 steps; (3) fake score model — trained in parallel to track the distribution produced by G. The distillation loss is the KL divergence between G's distribution and the real data distribution; its gradient with respect to G's parameters equals (score_fake − score_real) backpropagated through the generator. TTUR updates the fake score model every step but the generator every few steps, preventing instability. Additionally, a GAN discriminator is trained to distinguish G's samples from real images; its signal adds an adversarial loss improving detail quality. The entire procedure requires no teacher-generated dataset, distinguishing DMD2 from DMD v1.

Problem solved

Multi-step diffusion models (DDIM/DDPM) require 25–50 network evaluations per image, making them expensive at inference and impractical for real-time applications. Earlier distillation methods (Progressive Distillation, Consistency Models, DMD v1) either lost quality or required a costly upfront regression phase on a teacher-generated dataset. DMD2 solves both: it maintains near-teacher quality at 1–4 steps, eliminates dataset pre-generation, and is more stable during training thanks to TTUR.