LDM

Stable Diffusion (CompVis)

Generative Models (Stability AI, SDXL/SD3)

Diffusers (Hugging Face)

Original paper · 2022 · CVPR 2022 · Robin Rombach

Implementation pitfalls

Weak VAE reconstructionsHigh

The autoencoder is a bottleneck — anything that cannot be reconstructed from the latent is lost regardless of U-Net quality.

Fix:Better autoencoder (more channels as in SD3 16-ch), task-specific VAE fine-tuning, smaller downsampling factor.

Zero-SNR and image bleachingMedium

Standard (linear) schedules do not reach full noise at T, causing the model to generate images with muted contrast.

Fix:Zero-SNR schedule and v-prediction (Common Diffusion Noise Schedules paper).

CFG artifacts at high wMedium

Excessive classifier-free guidance produces oversaturation, posterization, and strange textures.

Fix:Dynamic thresholding (Imagen), CFG rescale, smaller w (4-7) for SDXL/SD3.

Resolution mismatchMedium

A model trained at e.g. 512×512 generates repeated objects at 1024×1024 (the "double heads" issue).

Fix:Resolution conditioning (SDXL), MultiDiffusion, hierarchical sampling.

Evolution

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer

2020

DDPM — foundations of diffusion models

Ho et al. introduce Denoising Diffusion Probabilistic Models in pixel space.

Diffusion Model (concept)Denoising Diffusion Probabilistic Models (paper)

2021

Classifier-Free Guidance

Ho & Salimans introduce CFG, the key conditioning mechanism in later LDMs.

Classifier-Free Diffusion Guidance (paper)

2021

LDM preprint (CompVis)

Inflection point

The first Rombach et al. preprint introduces the idea of diffusion in autoencoder latent space.

2022

Stable Diffusion 1.x — first open SoTA T2I

Inflection point

Stability AI / RunwayML release SD 1.4/1.5 based on LDM, democratizing text-to-image generation.

2023

SDXL — scaling LDM

Larger U-Net (2.6B), two text encoders, two-stage refinement; 1024 px natively.

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (paper)

2023

Stable Video Diffusion / AnimateDiff

Extension of LDM to video sequences with temporal attention blocks.

2024

SD3 — Diffusion Transformer + rectified flow

Inflection point

SD3 replaces the U-Net with an MMDiT architecture and switches from DDPM to rectified flow matching.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (paper)

2023

Diffusion Policy in robotics

Chi et al. show that an LDM-style architecture effectively models the action distribution in robotic manipulation.

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion (paper)

Sources

High-Resolution Image Synthesis with Latent Diffusion Models

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

arXiv / CVPR 2022

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3)

arXiv

Denoising Diffusion Probabilistic Models (DDPM)

arXiv / ICML 2024

Classifier-Free Diffusion Guidance

arXiv / NeurIPS 2020

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

arXiv