The encoder has several levels; each level is two 3×3 convs + ReLU + BatchNorm/GroupNorm followed by 2× downsampling (max-pool or stride-2). Channels double at each level (64 → 128 → 256 → 512 → 1024 in the original). A bottleneck at the bottom connects the paths. The decoder upsamples (transposed conv or interpolation + conv) and CONCATENATES the result with the corresponding encoder activation (skip connection), then applies two convs. The output is a 1×1 convolution mapping to class count (segmentation) or channel count (regression, noise prediction). In diffusion models the U-Net is additionally conditioned on: (a) a timestep t embedding (sinusoidal, added in every block), (b) a condition c embedding via cross-attention. Self-attention is applied in low-resolution blocks (8×8, 16×16, 32×32) where the N² cost is affordable.
Classical classification CNNs lose spatial precision through deep pooling. Pure decoder networks without direct access to early activations cannot recover precise object boundaries. U-Net solves this with skip connections that deliver "fresh" local information from encoder to decoder, enabling pixel-perfect prediction at original resolution.
A sequence of convolutional blocks with downsampling. Each level compresses spatially and doubles channels, extracting increasingly global features.
A symmetric upsampling path (transposed conv or interpolation + conv) restoring the original size while reducing channels.
Encoder activations are concatenated with decoder activations at the same resolution level. Critical for boundary precision.
Convolutional (and often attention) blocks operating at the smallest resolution and largest channel count; holds the most global context.
Sinusoidal embedding of t passed through an MLP and added as bias in each residual block. Specific to diffusion U-Nets.
Official
Q from U-Net activations, K/V from condition embedding c (e.g. CLIP). Inserted in attention blocks at multiple resolutions.
Official
Transposed conv with poor kernel sizing produces regular checkerboard patterns visible especially in generation.
Encoder-decoder resolution mismatches at odd sizes or wrong padding break concatenation.
3D medical volumes cost O(D·H·W·C) memory — easy to exceed VRAM even on an A100.
Batch statistics depend on the noise distribution and are unstable across timesteps t.
Ronneberger, Fischer & Brox publish U-Net for biomedical segmentation; wins the ISBI cell tracking challenge.
Extension to 3D medical volumes (Çiçek et al., Milletari et al.).
Isensee et al. build a generic U-Net framework that auto-configures hyperparameters per dataset.
Ho et al. use a U-Net with attention and timestep embedding as the standard ε_θ in diffusion.
Latent Diffusion (Rombach et al.) introduces a U-Net with text-conditioned cross-attention; backbone of SD 1.x/2.x/SDXL.
SDXL scales the U-Net to 2.6B parameters with two-stage refinement for native 1024 px.
Zhang & Agrawala add a parallel frozen copy of the U-Net for precise control (depth, pose, edges).
Diffusion Transformer (Peebles & Xie) and SD3 replace the U-Net with a pure Transformer architecture; U-Net remains dominant in many pipelines, however.
Number of pyramid levels (typically 4-5 in segmentation, 4-6 in diffusion).
Channels at the first level (64 in the original, 128-320 in SD/SDXL).
Per-level channel multipliers (e.g. [1,2,4,4] in SD).
Resolutions where self/cross-attention is applied (e.g. 32×32, 16×16, 8×8).
BatchNorm (original), GroupNorm (DDPM/SD), LayerNorm. GroupNorm is the diffusion standard.
Transposed conv vs interpolation + conv (fewer checkerboard artifacts).
The entire network is active for each image / diffusion step.
All convolutions, skip connections and attention layers are fully parallel within batch and spatial dimensions. No inner recurrence.
Convolutions, attention and upsampling map ideally to tensor cores. The entire U-Net family runs optimally on NVIDIA/AMD GPUs.
TPUs handle 2D convolutions very well; training of Imagen and other diffusion models runs on TPUs.
A small medical-segmentation U-Net runs on CPUs with AVX2/AVX-512, but diffusion inference is practically infeasible.