Architecture

VAE

2014ActivePublished: 8 June 2026Updated: 8 June 2026Published

Key innovation

Combines an autoencoder with a probabilistic latent space and variational inference, so a single differentiable model both learns representations and generates samples by maximizing the evidence lower bound (ELBO).

How it works

Training: for a sample x the encoder returns the parameters of q_φ(z|x) = 𝒩(μ_φ(x), σ²_φ(x)·I). Sample z = μ + σ·ε with ε ~ 𝒩(0,I) (reparameterization enables gradients). The decoder produces p_θ(x|z) (Gaussian or Bernoulli). Loss: ℒ_ELBO = − ℒ_recon (e.g. MSE or BCE) − β·KL(q_φ(z|x) ∥ p(z)), where β=1 is vanilla VAE; β-VAE uses other weights. Generation: sample z ~ p(z) = 𝒩(0,I), run through decoder. Variants: β-VAE (disentanglement control), VQ-VAE (discrete codes via vector quantization), KL-VAE (continuous, used in SD), conditional VAE (conditioning), hierarchical VAE (NVAE, VDVAE). In LDM pipelines the VAE is trained with LPIPS and adversarial losses for better perceptual reconstruction.

Problem solved

Classical autoencoders learn an arbitrary (deterministic) latent space that does not support generating new samples. VAE solves this by imposing a probabilistic structure and KL regularization — the latent space becomes smooth and samplable, enabling generation of new images/sequences and interpretable interpolations.

Components

Encoder (recognition network)Approximates the posterior p(z|x)

A neural network producing parameters of the posterior distribution (usually μ and log σ² of a Gaussian). For images: CNN. For sequences: RNN/Transformer.

INObservation x.

OUTPosterior parameters (μ and log σ²).

Decoder (generative network)Generates x from latent z

A network mapping z to a reconstruction / sample x̂. The decoder defines the conditional distribution of observations.

Reparameterization trickDifferentiable sampling from the posterior

z = μ + σ·ε, ε ~ 𝒩(0,I). Enables gradient propagation through a stochastic node, essential for SGD on ELBO.

Gaussian (klasyczne)Standard diagonal Gaussian parameterization.

Gumbel-softmax / straight-through (VQ-VAE, kategoryczne)Discrete / categorical latents with gradient approximation.

Official

KL divergence regularizationPulls the posterior close to the prior

KL(q_φ(z|x) ∥ p(z)) — analytically tractable for Gaussians. Regularization preventing collapse to a plain autoencoder.

Prior p(z)Defines the latent space for sampling

Classically 𝒩(0,I). Hierarchical VAEs use multi-level priors; VQ-VAE uses a categorical prior learned separately (PixelCNN, Transformer).

Implementation

Reference implementations

Diffusers — AutoencoderKL

Python

Official

PyTorch VAE examples (Kingma reference)

Python

Official

VQ-VAE / VQ-VAE-2 (DeepMind sonnet)

taming-transformers (KL-VAE + GAN, baza dla SD)

Python

Official

Implementation pitfalls

Posterior collapseCritical

The posterior q_φ(z|x) collapses to the prior p(z), the decoder ignores z, and the model loses representational power.

Fix:KL annealing, free bits, initial β<1, stronger encoder, autoregressive decoder receiving fuller information.

Blurry reconstructions (vanilla MSE)High

MSE/BCE losses cause averaging → blurry images.

Fix:LPIPS + adversarial loss (as in SD VAE), VQ-VAE with discrete codes, hierarchical VAEs.

Codebook collapse in VQ-VAEHigh

Most codebook entries stop being used — effective vocabulary size drops drastically.

Fix:EMA codebook updates, dead-code restart, low-dimensional projection (as in FSQ).

Prior–aggregated-posterior mismatchMedium

Sampling from p(z) hits regions the aggregated posterior avoids → poor quality.

Fix:Learn the prior separately (PixelCNN over VQ codes), VampPrior, two-stage VAE.

Evolution

Original paper · 2014 · ICLR 2014 · Diederik P. Kingma

Auto-Encoding Variational Bayes

Diederik P. Kingma, Max Welling

2014

VAE — Auto-Encoding Variational Bayes

Inflection point

Kingma & Welling formalize VAE with the reparameterization trick; in parallel Rezende, Mohamed & Wierstra publish "Stochastic Backpropagation".

Stochastic Backpropagation and Approximate Inference in Deep Generative Models (paper)

2017

β-VAE — disentanglement

Higgins et al. introduce β-VAE, controlling representation disentanglement via the KL weight.

β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework (paper)

2017

VQ-VAE — discrete latent codes

Inflection point

van den Oord et al. introduce vector quantization in latent space; foundation of DALL·E 1, MUSE, Parti.

Neural Discrete Representation Learning (paper)

2019

VQ-VAE-2 — hierarchical codes

Razavi et al. obtain high-resolution samples via hierarchical codes.

Generating Diverse High-Fidelity Images with VQ-VAE-2 (paper)

2019

PlaNet / Dreamer — VAE-like RSSM in RL

Hafner et al. use variational latent dynamics for model-based RL from pixels.

RSSM (concept)

2021

NVAE / VDVAE — deep hierarchical VAE

Vahdat & Kautz and Child show that very deep hierarchical VAEs compete with diffusion in quality.

NVAE: A Deep Hierarchical Variational Autoencoder (paper)

2022

KL-VAE as compressor in Stable Diffusion

Inflection point

Rombach et al. use a KL-VAE with LPIPS + adversarial loss as the first stage of LDM.

LDM (concept)

2024

SD3 — 16-channel KL-VAE

Stability AI scales the VAE from 4 to 16 channels for substantially better reconstruction in SD3.

Sources

Auto-Encoding Variational Bayes

Paper

arXiv / ICLR 2014

Stochastic Backpropagation and Approximate Inference in Deep Generative Models

Paper

arXiv / ICML 2014

β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework

Paper

ICLR 2017

Neural Discrete Representation Learning (VQ-VAE)

Paper

arXiv / NeurIPS 2017

Generating Diverse High-Fidelity Images with VQ-VAE-2

Paper

arXiv / NeurIPS 2019

NVAE: A Deep Hierarchical Variational Autoencoder

Paper

arXiv / NeurIPS 2020

High-Resolution Image Synthesis with Latent Diffusion Models (KL-VAE)

Paper

arXiv / CVPR 2022

An Introduction to Variational Autoencoders (Kingma & Welling)

Paper

arXiv (book-length)

VAE

How it works

Problem solved

Components

Implementation

Evolution

Sources

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements