In the forward process, Gaussian noise is gradually added to data over many steps. A neural network learns to reverse this process — predicting and removing noise step by step. During generation, the model starts from pure noise and iteratively denoises it.
Generating high-quality images, audio, and other continuous data was challenging for earlier generative models (GANs, VAEs). Diffusion models achieve better quality and training stability.
A fixed, non-learnable Markov chain that iteratively corrupts a data sample x0 by adding Gaussian noise according to a predefined variance schedule {β1, ..., βT}, transforming it toward an isotropic Gaussian. Analytically tractable: any timestep t can be sampled in closed form.
A learned Markov chain that models the reverse transition p_θ(x_{t-1}|x_t) as a Gaussian with mean and variance predicted by a neural network. At inference, T sequential denoising steps transform pure noise xT ~ N(0, I) into a sample x0.
A neural network parameterized by θ, conditioned on the noised input xt and the timestep t, that predicts the noise component ε (in DDPM parameterization) or the score function. Typically implemented as a U-Net with sinusoidal timestep embeddings and self-attention layers; alternatively a Transformer (DiT).
Official
A sequence of hyperparameters {β1, ..., βT} specifying how much noise is added at each forward step. The schedule determines the rate at which the data distribution transitions to Gaussian noise and affects the difficulty of the reverse denoising task.
Official
Sinusoidal or learned embedding of the integer timestep t, injected into each residual block of the denoising network to condition it on the current noise level.
Official
The default DDPM reverse process requires T=1000 sequential denoising steps, each requiring a full network forward pass, making inference orders of magnitude slower than single-pass generative models like GANs.
The linear noise schedule from the original DDPM can destroy data signal too aggressively at early timesteps for high-resolution images, leading to suboptimal training. This schedule is not universally optimal across data types.
High classifier-free guidance (CFG) weights improve condition adherence but cause out-of-distribution denoised samples, resulting in oversaturated or artifact-ridden outputs due to a train-inference mismatch.
Diffusion models typically require very long training runs (hundreds of thousands to millions of gradient steps) to converge to high sample quality, especially at high resolutions.
Sohl-Dickstein et al. published 'Deep Unsupervised Learning using Nonequilibrium Thermodynamics' at ICML 2015, introducing the forward-reverse diffusion framework inspired by non-equilibrium thermodynamics as a tractable generative model.
Ho, Jain, and Abbeel published 'Denoising Diffusion Probabilistic Models' (DDPM) at NeurIPS 2020, reframing diffusion models with a simplified noise-prediction objective and achieving GAN-competitive image quality on CIFAR-10 (FID 3.17).
Song, Meng, and Ermon proposed Denoising Diffusion Implicit Models (DDIM), enabling non-Markovian sampling that reduces required inference steps from 1000 to 50–100 without retraining.
Nichol and Dhariwal published 'Improved Denoising Diffusion Probabilistic Models', introducing the cosine noise schedule and learned variance, improving log-likelihoods and generation quality.
Dhariwal and Nichol demonstrated that diffusion models with classifier guidance surpass state-of-the-art GANs on FID metrics on ImageNet 256×256, establishing diffusion models as the leading paradigm for high-quality image generation.
Song et al. published 'Score-Based Generative Modeling through Stochastic Differential Equations' (ICLR 2021), unifying DDPM and score-based generative models under a continuous-time SDE framework.
Rombach et al. published 'High-Resolution Image Synthesis with Latent Diffusion Models' (CVPR 2022), applying diffusion in a learned latent space to reduce computational cost. This work led directly to Stable Diffusion, open-sourced by Stability AI.
Time complexity: O(T · C_net) per sample at inference. Space complexity: O(D) for latent state; O(P) for model parameters.
Inference requires T sequential passes through the denoising network because each step depends on the output of the previous step (Markov property), making latency proportional to T and preventing naive step-level parallelism.
Controls the number of forward and reverse Markov chain steps. Larger T generally improves sample quality but increases inference cost linearly.
Defines the variance schedule {β1, ..., βT}. Common choices: linear (original DDPM), cosine (Improved DDPM), sigmoid.
Whether the denoising network predicts the noise ε (epsilon-parameterization, standard in DDPM), the original data x0, or the score function.
Architecture of the neural network parameterizing the reverse process. Affects capacity, training speed, and generalization.
Each denoising step applies the full network to the entire data tensor. The base diffusion model concept includes no expert routing or conditional activation sparsity.
Training is fully parallel: each sample uses a randomly selected step t, so batches of independent training examples can be processed in parallel. Inference is sequential for a single sample, but multiple samples can be generated in parallel (throughput parallelism). Approaches such as Picard iteration (ParaDiGMS) explore the compute–latency tradeoff.
Training and inference for diffusion models involve large batches of dense floating-point operations (convolutions, attention) on image-resolution tensors, which map well to GPU Tensor Core parallelism. Training at high resolutions requires substantial VRAM.
TPUs are used to train large diffusion models (e.g., Imagen by Google Brain) and handle the dense matrix operations required by U-Net and Transformer backbones via JAX/Flax.