The input to DiT is a noisy latent z_t (e.g. 32×32×4 from the Stable Diffusion VAE). (1) Patchify: the latent is split into p×p patches (p=2,4,8) and each is linearly projected to a token of dimension d, forming a sequence of T = (h/p)·(w/p) tokens. (2) Positional embedding: 2D position embeddings (usually sinusoidal or learned) are added. (3) Conditioning: timestep t and condition c are encoded and injected into each block. The best variant, adaLN-Zero, parameterizes LayerNorm scale and shift (γ, β) and residual gates (α) as functions of (t,c), with α initialized to zero (each block starts as identity). (4) Transformer blocks: standard self-attention + MLP over the patch tokens. (5) Decode: a final LayerNorm + linear projects tokens back to noise/covariance patches, which are rearranged into the latent ε. Inference works as in any diffusion model — iterative denoising with a scheduler and CFG for conditioning. MMDiT (SD3) extends this with separate text and image token streams merged in attention blocks.
The convolutional U-Net in diffusion models has limited scalability — quality saturates as parameters grow, and the inductive biases of convolution hinder exploiting huge compute budgets. DiT shows that a pure Transformer scales much better (quality grows monotonically with FLOPs), benefiting from the same scaling laws as LLMs.
Splits the latent into p×p patches and linearly projects each to a token of dimension d. Smaller p → more tokens → higher quality but higher cost.
Adaptive LayerNorm regressing scale/shift (γ,β) and residual gates (α) from the (t,c) embedding. α initialized to 0 → each block starts as identity, stabilizing training.
Official
Standard Transformer blocks (multi-head self-attention + feed-forward). Global receptive field over all patches from the first layer.
Sinusoidal or learned position embeddings added to patch tokens (the Transformer itself is permutation-invariant).
LayerNorm + linear mapping each token to a predicted-noise patch (and optionally covariance), rearranged into the latent ε.
Reducing p from 8 to 2 increases token count 16× and attention cost 256× — easy to exceed memory.
Naive conditioning (in-context, cross-attn) trains worse and less stably than adaLN-Zero.
DiT needs more data and compute than U-Net to learn local correlations that convolution provides for free.
Peebles & Xie show a Transformer replacing the U-Net scales monotonically with FLOPs and beats U-Net on ImageNet.
Chen et al. train a text-to-image DiT at a fraction of SD's cost, using cross-attention for text.
OpenAI describes Sora as a diffusion transformer operating on spacetime latent video patches.
Stability AI introduces MMDiT with separate text and image token streams plus rectified flow.
Black Forest Labs releases Flux, a leading open model based on DiT/MMDiT (12B).
Time complexity: O(T² · d) per krok odszumiania, T = liczba patchy, d = wymiar modelu.
Patch size (2, 4, 8). Smaller p → more tokens → higher quality, quadratically higher cost.
DiT-S/B/L/XL configurations analogous to ViT. Scaling improves FID monotonically.
adaLN-Zero vs in-context vs cross-attention. adaLN-Zero achieved the best results.
Determined by latent resolution and patch size. Attention cost grows O(T²).
The entire Transformer is active per denoising step (except MoE-DiT variants).
Training and a single inference step are fully token-parallel (as in any Transformer). Only the diffusion denoising steps remain sequential — common to all diffusion models.
A pure Transformer is a matmul-heavy workload ideal for tensor cores; it benefits from flash-attention and LLM-grade optimizations.
Transformers parallelize well on TPU (XLA); large DiT/MMDiT trainings leverage TPU/multi-GPU.