Architecture

Pixel Diffusion

2020ActivePublished: 8 June 2026Updated: 8 June 2026Published

Key innovation

Running the full diffusion process directly on image pixels (without compressing to a latent), yielding the highest pixel fidelity at the cost of a large compute and memory budget.

How it works

The forward process gradually adds Gaussian noise to the image over T steps: x_t = √(α̅_t)·x_0 + √(1−α̅_t)·ε, where α̅_t decreases with t. A network (U-Net) ε_θ(x_t, t) learns to predict the added noise ε (ε-prediction parameterization), minimizing ℒ = E[‖ε − ε_θ(x_t,t)‖²]. Training is performed directly on full-resolution pixels. Inference: start from pure noise x_T ~ 𝒩(0,I) and iteratively denoise over T (or fewer, with DDIM/DPM-Solver) steps directly on pixels. Text/class conditioning via cross-attention or adaLN, amplified by classifier(-free) guidance. For high resolutions, cascades are used: a base model generates e.g. 64×64, and subsequent super-resolution diffusion models upscale to 256×256 and 1024×1024 (Imagen, Cascaded Diffusion).

Problem solved

Pixel-Space Diffusion addresses generating high-quality, diverse images with exact pixel fidelity without the information loss introduced by autoencoder compression. It models the full data distribution in the original space, avoiding the VAE reconstruction artifacts present in latent approaches.

Components

Forward (noising) processGradually noises image pixels

A Markov chain adding Gaussian noise: x_t = √(α̅_t)·x_0 + √(1−α̅_t)·ε. Fixed, with no learned parameters.

Denoising network (U-Net)Predicts noise at full pixel resolution

A U-Net with attention and timestep embedding operating directly on pixels. Cost grows with image resolution.

Official

Noise scheduleThe α̅_t schedule

Linear (DDPM), cosine (Improved DDPM), zero-SNR. Determines the noising rate and sample quality.

Cascade / super-resolution stagesScaling to high resolutions

Successive super-resolution diffusion models upscaling resolution (Imagen: 64→256→1024). Mitigates the cost of direct high-resolution generation.

Official

Implementation

Reference implementations

guided-diffusion / ADM (OpenAI)

Python

Official

improved-diffusion (OpenAI)

Python

Official

denoising-diffusion-pytorch (lucidrains)

Python

imagen-pytorch (lucidrains)

Python

Implementation pitfalls

Very high compute costHigh

A full-resolution U-Net over hundreds of steps is 1-2 orders more expensive than Latent Diffusion.

Fix:Cascades (low-res base + super-res), faster samplers (DPM-Solver), distillation, or moving to latent.

Oversaturation at high CFGMedium

High guidance scale causes oversaturated, burnt-out images.

Fix:Dynamic thresholding (Imagen), CFG rescale, zero-SNR.

Cascade instabilityMedium

Artifacts from the base model are amplified by subsequent super-resolution stages.

Fix:Noise conditioning augmentation between cascade stages.

Evolution

Original paper · 2020 · NeurIPS 2020 · Jonathan Ho

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, Pieter Abbeel

2015

Deep Unsupervised Learning via Nonequilibrium Thermodynamics

Sohl-Dickstein et al. introduce the original idea of pixel-space diffusion models.

Deep Unsupervised Learning using Nonequilibrium Thermodynamics (paper)

2020

DDPM — practical pixel-space diffusion

Inflection point

Ho et al. establish a simple, effective training recipe (ε-prediction, U-Net), launching the diffusion era.

Diffusion Model (concept)

2021

Improved DDPM

Nichol & Dhariwal introduce the cosine schedule and learned variance, improving log-likelihood.

Improved Denoising Diffusion Probabilistic Models (paper)

2021

ADM — Diffusion Models Beat GANs

Inflection point

Dhariwal & Nichol with classifier guidance surpass GANs on ImageNet, setting pixel-space SoTA.

Diffusion Models Beat GANs on Image Synthesis (paper)

2022

GLIDE and DALL·E 2 — pixel text-to-image

OpenAI applies pixel-space diffusion with CFG for text conditioning (DALL·E 2 = prior + unCLIP decoder).

2022

Imagen — cascaded pixel-space diffusion

Inflection point

Google uses T5 + a 64→256→1024 cascade with dynamic thresholding, reaching high photorealistic quality.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen) (paper)

2022

Latent Diffusion shifts the lead to latent space

LDM/Stable Diffusion shows latent-space diffusion is 1-2 orders cheaper at comparable quality, narrowing the role of pure pixel-space.

LDM (concept)

Sources

Denoising Diffusion Probabilistic Models (DDPM)

Paper

arXiv / NeurIPS 2020

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Paper

arXiv / ICML 2015

Improved Denoising Diffusion Probabilistic Models

Paper

arXiv / ICML 2021

Diffusion Models Beat GANs on Image Synthesis (ADM)

Paper

arXiv / NeurIPS 2021

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)

Paper

arXiv / NeurIPS 2022

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Paper

arXiv

Pixel Diffusion

How it works

Problem solved

Components

Implementation

Evolution

Sources

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements