Architecture

Diffusion Model

Key innovation

Diffusion Model introduced a generative paradigm based on reversing a stochastic Gaussian noise addition process, enabling stable training of deep generative models without adversarial objectives and without the architectural constraints imposed by invertible flow-based networks.

How it works

In the forward process, Gaussian noise is gradually added to data over many steps. A neural network learns to reverse this process — predicting and removing noise step by step. During generation, the model starts from pure noise and iteratively denoises it.

Problem solved

Generating high-quality images, audio, and other continuous data was challenging for earlier generative models (GANs, VAEs). Diffusion models achieve better quality and training stability.

Components

Forward Diffusion ProcessGradually adds Gaussian noise to data over T steps, producing a sequence of noisy samples used as training targets.

A fixed, non-learnable Markov chain that iteratively corrupts a data sample x0 by adding Gaussian noise according to a predefined variance schedule {β1, ..., βT}, transforming it toward an isotropic Gaussian. Analytically tractable: any timestep t can be sampled in closed form.

INClean data sample x0 from the training dataset; shape depends on data modality.

OUTNoised sample xt at a given timestep t, with added Gaussian noise.

Linear noise scheduleVariances β increase linearly from β1 to βT, as used in the original DDPM paper by Ho et al. (2020).

Cosine Noise ScheduleVariances follow a cosine function, introduced by Nichol and Dhariwal (2021) in Improved DDPM to reduce premature data destruction.

Reverse Diffusion ProcessIteratively denoises a noisy sample over T steps, progressing from pure Gaussian noise to a sample from the data distribution.

A learned Markov chain that models the reverse transition p_θ(x_{t-1}|x_t) as a Gaussian with mean and variance predicted by a neural network. At inference, T sequential denoising steps transform pure noise xT ~ N(0, I) into a sample x0.

Denoising network (backbone)Noise-prediction network conditioned on the timestep index, estimating either the noise or the mean of the reverse distribution at each denoising step.

A neural network parameterized by θ, conditioned on the noised input xt and the timestep t, that predicts the noise component ε (in DDPM parameterization) or the score function. Typically implemented as a U-Net with sinusoidal timestep embeddings and self-attention layers; alternatively a Transformer (DiT).

Backbone U-NetEncoder-decoder architecture with skip connections and residual blocks, used in the original DDPM and most image-generation diffusion models.

Diffusion Transformer (DiT)Transformer-based backbone for the denoising network, using patch embeddings of the noised input, as used in Sora and related video-generation models.

Official

Harmonogram szumuDefines the variance schedule {β1, ..., βT} controlling the rate of noise addition during the diffusion process, directly affecting generation quality and training stability.

A sequence of hyperparameters {β1, ..., βT} specifying how much noise is added at each forward step. The schedule determines the rate at which the data distribution transitions to Gaussian noise and affects the difficulty of the reverse denoising task.

Official

Timestep EmbeddingEncodes the timestep index t as a continuous vector and injects it into the denoising network, enabling the model to adapt its behavior to the current noise level.

Sinusoidal or learned embedding of the integer timestep t, injected into each residual block of the denoising network to condition it on the current noise level.

Official

Implementation

Reference implementations

hojonathanho/diffusion (original DDPM)

Python (TensorFlow) · Jonathan Ho

Official

Hugging Face Diffusers

Python (PyTorch) · Hugging Face

openai/improved-diffusion

Python (PyTorch) · OpenAI

Official

Implementation pitfalls

Very slow inference due to high step countHigh

The default DDPM reverse process requires T=1000 sequential denoising steps, each requiring a full network forward pass, making inference orders of magnitude slower than single-pass generative models like GANs.

Fix:Use accelerated samplers such as DDIM, DPM-Solver, or PNDM to reduce effective steps to 20–100. Alternatively, use Latent Diffusion Models to operate in a compressed latent space.

Noise schedule mismatch relative to data resolution and domainMedium

The linear noise schedule from the original DDPM can destroy data signal too aggressively at early timesteps for high-resolution images, leading to suboptimal training. This schedule is not universally optimal across data types.

Fix:Use a cosine noise schedule (Nichol and Dhariwal, 2021) or explore schedules tailored to the specific domain and data resolution.

Image saturation at high classifier-free guidance weightsMedium

High classifier-free guidance (CFG) weights improve condition adherence but cause out-of-distribution denoised samples, resulting in oversaturated or artifact-ridden outputs due to a train-inference mismatch.

Fix:Use dynamic thresholding (Saharia et al., Imagen) or carefully tune the CFG weight. Values in the 5–15 range are typical for text-to-image; exceeding this range risks quality degradation.

Insufficient number of training stepsHigh

Diffusion models typically require very long training runs (hundreds of thousands to millions of gradient steps) to converge to high sample quality, especially at high resolutions.

Fix:Monitor FID on the validation set. Use exponential moving average (EMA) of model weights during training — EMA weights consistently produce better samples than the raw model.

Evolution

Original paper · 2015 · ICML 2015 · Jascha Sohl-Dickstein

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli

2015

First formal definition of diffusion generative models (Sohl-Dickstein et al.)

Inflection point

Sohl-Dickstein et al. published 'Deep Unsupervised Learning using Nonequilibrium Thermodynamics' at ICML 2015, introducing the forward-reverse diffusion framework inspired by non-equilibrium thermodynamics as a tractable generative model.

Deep Unsupervised Learning using Nonequilibrium Thermodynamics (paper)

2020

DDPM: practical high-quality image generation (Ho et al.)

Inflection point

Ho, Jain, and Abbeel published 'Denoising Diffusion Probabilistic Models' (DDPM) at NeurIPS 2020, reframing diffusion models with a simplified noise-prediction objective and achieving GAN-competitive image quality on CIFAR-10 (FID 3.17).

Denoising Diffusion Probabilistic Models (paper)

2020

DDIM: accelerated non-deterministic sampling (Song et al.)

Inflection point

Song, Meng, and Ermon proposed Denoising Diffusion Implicit Models (DDIM), enabling non-Markovian sampling that reduces required inference steps from 1000 to 50–100 without retraining.

Denoising Diffusion Implicit Models (paper)

2021

Improved DDPM: cosine schedule and log-likelihood (Nichol and Dhariwal)

Nichol and Dhariwal published 'Improved Denoising Diffusion Probabilistic Models', introducing the cosine noise schedule and learned variance, improving log-likelihoods and generation quality.

Improved Denoising Diffusion Probabilistic Models (paper)

2021

Diffusion Models surpass GANs in image synthesis (Dhariwal and Nichol)

Inflection point

Dhariwal and Nichol demonstrated that diffusion models with classifier guidance surpass state-of-the-art GANs on FID metrics on ImageNet 256×256, establishing diffusion models as the leading paradigm for high-quality image generation.

Diffusion Models Beat GANs on Image Synthesis (paper)

2021

Unification via SDE (Song et al.)

Song et al. published 'Score-Based Generative Modeling through Stochastic Differential Equations' (ICLR 2021), unifying DDPM and score-based generative models under a continuous-time SDE framework.

Score-Based Generative Modeling through Stochastic Differential Equations (paper)

2022

Latent Diffusion Models and Stable Diffusion (Rombach et al.)

Inflection point

Rombach et al. published 'High-Resolution Image Synthesis with Latent Diffusion Models' (CVPR 2022), applying diffusion in a learned latent space to reduce computational cost. This work led directly to Stable Diffusion, open-sourced by Stability AI.

High-Resolution Image Synthesis with Latent Diffusion Models (paper)

Sources

Denoising Diffusion Probabilistic Models

Diffusion Model

How it works

Problem solved

Components

Implementation

Evolution

Sources

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements