The forward process gradually adds Gaussian noise to the image over T steps: x_t = √(α̅_t)·x_0 + √(1−α̅_t)·ε, where α̅_t decreases with t. A network (U-Net) ε_θ(x_t, t) learns to predict the added noise ε (ε-prediction parameterization), minimizing ℒ = E[‖ε − ε_θ(x_t,t)‖²]. Training is performed directly on full-resolution pixels. Inference: start from pure noise x_T ~ 𝒩(0,I) and iteratively denoise over T (or fewer, with DDIM/DPM-Solver) steps directly on pixels. Text/class conditioning via cross-attention or adaLN, amplified by classifier(-free) guidance. For high resolutions, cascades are used: a base model generates e.g. 64×64, and subsequent super-resolution diffusion models upscale to 256×256 and 1024×1024 (Imagen, Cascaded Diffusion).
Pixel-Space Diffusion addresses generating high-quality, diverse images with exact pixel fidelity without the information loss introduced by autoencoder compression. It models the full data distribution in the original space, avoiding the VAE reconstruction artifacts present in latent approaches.
A Markov chain adding Gaussian noise: x_t = √(α̅_t)·x_0 + √(1−α̅_t)·ε. Fixed, with no learned parameters.
A U-Net with attention and timestep embedding operating directly on pixels. Cost grows with image resolution.
Official
Linear (DDPM), cosine (Improved DDPM), zero-SNR. Determines the noising rate and sample quality.
Successive super-resolution diffusion models upscaling resolution (Imagen: 64→256→1024). Mitigates the cost of direct high-resolution generation.
Official
A full-resolution U-Net over hundreds of steps is 1-2 orders more expensive than Latent Diffusion.
High guidance scale causes oversaturated, burnt-out images.
Artifacts from the base model are amplified by subsequent super-resolution stages.
Sohl-Dickstein et al. introduce the original idea of pixel-space diffusion models.
Ho et al. establish a simple, effective training recipe (ε-prediction, U-Net), launching the diffusion era.
Nichol & Dhariwal introduce the cosine schedule and learned variance, improving log-likelihood.
Dhariwal & Nichol with classifier guidance surpass GANs on ImageNet, setting pixel-space SoTA.
OpenAI applies pixel-space diffusion with CFG for text conditioning (DALL·E 2 = prior + unCLIP decoder).
Google uses T5 + a 64→256→1024 cascade with dynamic thresholding, reaching high photorealistic quality.
LDM/Stable Diffusion shows latent-space diffusion is 1-2 orders cheaper at comparable quality, narrowing the role of pure pixel-space.
Chain length (1000 in training, 25-250 at inference with DDIM/DPM-Solver).
Linear / cosine / zero-SNR — affects quality and contrast.
Base model resolution (e.g. 64×64 in Imagen before cascade).
Number of super-resolution models in the cascade.
Classifier-free guidance strength (Imagen uses dynamic thresholding at high w).
The entire U-Net is active at each denoising step at full resolution.
Training is fully batch-parallel. Inference requires sequential denoising steps, each a dense full-resolution U-Net forward pass — much more expensive than in Latent Diffusion.
U-Net convolutions and attention fit tensor cores, but full resolution demands large HBM — A100/H100 preferred.
Imagen was trained on TPU; cascades and convolutions scale well on TPU pods.