Training: for a sample x the encoder returns the parameters of q_φ(z|x) = 𝒩(μ_φ(x), σ²_φ(x)·I). Sample z = μ + σ·ε with ε ~ 𝒩(0,I) (reparameterization enables gradients). The decoder produces p_θ(x|z) (Gaussian or Bernoulli). Loss: ℒ_ELBO = − ℒ_recon (e.g. MSE or BCE) − β·KL(q_φ(z|x) ∥ p(z)), where β=1 is vanilla VAE; β-VAE uses other weights. Generation: sample z ~ p(z) = 𝒩(0,I), run through decoder. Variants: β-VAE (disentanglement control), VQ-VAE (discrete codes via vector quantization), KL-VAE (continuous, used in SD), conditional VAE (conditioning), hierarchical VAE (NVAE, VDVAE). In LDM pipelines the VAE is trained with LPIPS and adversarial losses for better perceptual reconstruction.
Classical autoencoders learn an arbitrary (deterministic) latent space that does not support generating new samples. VAE solves this by imposing a probabilistic structure and KL regularization — the latent space becomes smooth and samplable, enabling generation of new images/sequences and interpretable interpolations.
A neural network producing parameters of the posterior distribution (usually μ and log σ² of a Gaussian). For images: CNN. For sequences: RNN/Transformer.
A network mapping z to a reconstruction / sample x̂. The decoder defines the conditional distribution of observations.
z = μ + σ·ε, ε ~ 𝒩(0,I). Enables gradient propagation through a stochastic node, essential for SGD on ELBO.
Official
KL(q_φ(z|x) ∥ p(z)) — analytically tractable for Gaussians. Regularization preventing collapse to a plain autoencoder.
Classically 𝒩(0,I). Hierarchical VAEs use multi-level priors; VQ-VAE uses a categorical prior learned separately (PixelCNN, Transformer).
The posterior q_φ(z|x) collapses to the prior p(z), the decoder ignores z, and the model loses representational power.
MSE/BCE losses cause averaging → blurry images.
Most codebook entries stop being used — effective vocabulary size drops drastically.
Sampling from p(z) hits regions the aggregated posterior avoids → poor quality.
Kingma & Welling formalize VAE with the reparameterization trick; in parallel Rezende, Mohamed & Wierstra publish "Stochastic Backpropagation".
Higgins et al. introduce β-VAE, controlling representation disentanglement via the KL weight.
van den Oord et al. introduce vector quantization in latent space; foundation of DALL·E 1, MUSE, Parti.
Razavi et al. obtain high-resolution samples via hierarchical codes.
Hafner et al. use variational latent dynamics for model-based RL from pixels.
Vahdat & Kautz and Child show that very deep hierarchical VAEs compete with diffusion in quality.
Rombach et al. use a KL-VAE with LPIPS + adversarial loss as the first stage of LDM.
Stability AI scales the VAE from 4 to 16 channels for substantially better reconstruction in SD3.
Latent space size. Too small → information loss; too large → harder sampling.
KL weight in β-VAE. β>1 → stronger disentanglement, β<1 → better reconstruction.
Gradually increasing KL weight against posterior collapse, especially in sequential VAEs.
Threshold below which KL is not penalized; prevents posterior collapse.
Diagonal Gaussian / full-covariance Gaussian / VQ (categorical) / hierarchical.
MSE / BCE / LPIPS + adversarial (as in SD VAE) — strongly affects perceptual quality.
The entire encoder and decoder are active at every step.
Encoder, decoder, and ELBO computations are fully batch-parallel. No time-recurrence (except in sequential variants such as RSSM).
Encoder and decoder are mostly CNN/Transformer — ideally fit tensor cores. KL and ELBO are cheap tensor arithmetic.
VAE training in JAX/TPU is well supported (Diffusers, Flax).