The LDM pipeline has two separate training stages and one inference stage. Training stage 1: an autoencoder (encoder E + decoder D) is trained on image data with a loss combining perceptual reconstruction (LPIPS), regularization (KL or VQ), and an adversarial discriminator. Its weights are then frozen. Training stage 2: for an image x compute latents z = E(x). The forward diffusion adds Gaussian noise: z_t = √(α̅_t)·z + √(1-α̅_t)·ε. A U-Net ε_θ(z_t, t, c) learns to predict the noise ε conditioned on an embedding c (text via cross-attention). Inference: start from pure noise z_T ~ 𝒩(0,I) and iteratively denoise using a scheduler (DDIM, DPM-Solver, Euler) over 20-50 steps, obtaining ẑ_0. The final image is x̂ = D(ẑ_0). Classifier-free guidance (CFG) strengthens conditioning: ε̃ = ε(z,∅) + w·(ε(z,c) − ε(z,∅)).
Classical pixel-space diffusion models (DDPM, ADM) require enormous compute and memory because the U-Net must operate on image-resolution tensors over hundreds of denoising steps. Training SoTA pixel-space diffusion at 256×256 takes hundreds of GPU-days. LDM reduces this by an order of magnitude by running diffusion in a 4×-16× downsampled latent space, enabling 512-1024 px generation on a single consumer-grade GPU.
Encoder E: x → z and decoder D: z → x. Trained with LPIPS + KL/VQ + adversarial discriminator. In SD 1.x the downsampling factor is 8× (latent 64×64×4 for a 512×512 image).
Official
U-Net with residual blocks plus self-attention and cross-attention layers. Inputs: noisy latent z_t and timestep t (sinusoidal embedding). Output: predicted noise ε.
Official
External encoder (CLIP text encoder, T5, OpenCLIP) producing token vectors fed into the U-Net via cross-attention.
Q from U-Net activations, K/V from conditioning embedding c. Realizes text and multimodal conditioning.
Classical: linear, cosine, scaled-linear. Inference samplers: DDIM, DDPM, DPM-Solver, Euler, Heun, UniPC.
The autoencoder is a bottleneck — anything that cannot be reconstructed from the latent is lost regardless of U-Net quality.
Standard (linear) schedules do not reach full noise at T, causing the model to generate images with muted contrast.
Excessive classifier-free guidance produces oversaturation, posterization, and strange textures.
A model trained at e.g. 512×512 generates repeated objects at 1024×1024 (the "double heads" issue).
Ho et al. introduce Denoising Diffusion Probabilistic Models in pixel space.
Ho & Salimans introduce CFG, the key conditioning mechanism in later LDMs.
The first Rombach et al. preprint introduces the idea of diffusion in autoencoder latent space.
Stability AI / RunwayML release SD 1.4/1.5 based on LDM, democratizing text-to-image generation.
Larger U-Net (2.6B), two text encoders, two-stage refinement; 1024 px natively.
Extension of LDM to video sequences with temporal attention blocks.
SD3 replaces the U-Net with an MMDiT architecture and switches from DDPM to rectified flow matching.
Chi et al. show that an LDM-style architecture effectively models the action distribution in robotic manipulation.
Ratio of image resolution to latent resolution (typically 4, 8, 16). f=8 is the SD 1.x standard.
Number of latent channels (4 in SD 1.x/2.x, 16 in SD3 for better reconstruction).
Forward chain length (usually 1000 in training, 20-50 at inference with DDIM/DPM-Solver).
Linear, cosine, scaled-linear, zero-SNR — strongly affects quality and contrast.
Classifier-free guidance strength (typically 5-12 for image, 1-3 for video).
What the network predicts: noise ε, original x₀, or v-prediction (better for video and SD2).
The entire U-Net (or DiT) is active at each denoising step.
Training is fully batch-parallel. Inference requires sequential denoising steps (20-1000), but each step is a dense U-Net forward pass that fully exploits GPU parallelism.
Latent-space diffusion and the U-Net convolution/attention ops map ideally onto tensor cores. SD 1.5 fits in 4 GB VRAM, SDXL in 8-12 GB.
Training and inference in JAX/TPU are well supported (e.g. Diffusers has a Flax backend).
Extremely slow inference (minutes per image) is possible with AVX/MKL optimizations.