Architecture

DiT

2023ActivePublished: 8 June 2026Updated: 8 June 2026Published

Key innovation

Replaces the convolutional U-Net in diffusion models with a pure Transformer operating on latent patches, yielding better scalability (quality grows monotonically with FLOPs) and a simpler architecture.

How it works

The input to DiT is a noisy latent z_t (e.g. 32×32×4 from the Stable Diffusion VAE). (1) Patchify: the latent is split into p×p patches (p=2,4,8) and each is linearly projected to a token of dimension d, forming a sequence of T = (h/p)·(w/p) tokens. (2) Positional embedding: 2D position embeddings (usually sinusoidal or learned) are added. (3) Conditioning: timestep t and condition c are encoded and injected into each block. The best variant, adaLN-Zero, parameterizes LayerNorm scale and shift (γ, β) and residual gates (α) as functions of (t,c), with α initialized to zero (each block starts as identity). (4) Transformer blocks: standard self-attention + MLP over the patch tokens. (5) Decode: a final LayerNorm + linear projects tokens back to noise/covariance patches, which are rearranged into the latent ε. Inference works as in any diffusion model — iterative denoising with a scheduler and CFG for conditioning. MMDiT (SD3) extends this with separate text and image token streams merged in attention blocks.

Problem solved

The convolutional U-Net in diffusion models has limited scalability — quality saturates as parameters grow, and the inductive biases of convolution hinder exploiting huge compute budgets. DiT shows that a pure Transformer scales much better (quality grows monotonically with FLOPs), benefiting from the same scaling laws as LLMs.

Components

Patchify layerConverts the latent into a token sequence

Splits the latent into p×p patches and linearly projects each to a token of dimension d. Smaller p → more tokens → higher quality but higher cost.

INNoisy latent from the VAE.

OUTSequence of T patch tokens.

adaLN-Zero conditioningConditions on timestep t and class/text c

Adaptive LayerNorm regressing scale/shift (γ,β) and residual gates (α) from the (t,c) embedding. α initialized to 0 → each block starts as identity, stabilizing training.

adaLN-Zero (najlepszy)With residual gates initialized to zero.

In-context conditioningConditions as extra tokens in the sequence.

Cross-attention conditioningSeparate cross-attention blocks (as in SD U-Net).

Official

Transformer blocksProcesses tokens via self-attention + MLP

Standard Transformer blocks (multi-head self-attention + feed-forward). Global receptive field over all patches from the first layer.

Positional embeddingEncodes 2D patch positions

Sinusoidal or learned position embeddings added to patch tokens (the Transformer itself is permutation-invariant).

Final linear decoderProjects tokens back to the noise latent

LayerNorm + linear mapping each token to a predicted-noise patch (and optionally covariance), rearranged into the latent ε.

Implementation

Reference implementations

DiT (oficjalna, Meta/Berkeley)

Diffusers — DiTTransformer2DModel / SD3

Python

Official

Flux (Black Forest Labs)

Python

Official

Implementation pitfalls

Quadratic attention cost at small patch sizeHigh

Reducing p from 8 to 2 increases token count 16× and attention cost 256× — easy to exceed memory.

Fix:Larger patch + better VAE, flash-attention, token merging, resolution curriculum.

Training instability without adaLN-ZeroMedium

Naive conditioning (in-context, cross-attn) trains worse and less stably than adaLN-Zero.

Fix:Use adaLN-Zero with residual gates initialized to zero.

No locality inductive biasMedium

DiT needs more data and compute than U-Net to learn local correlations that convolution provides for free.

Fix:Pretraining on large datasets, larger compute budget, optionally hybrid conv blocks.

Evolution

Original paper · 2023 · ICCV 2023 · William Peebles

Scalable Diffusion Models with Transformers

William Peebles, Saining Xie

2022

DiT — preprint and scaling laws

Inflection point

Peebles & Xie show a Transformer replacing the U-Net scales monotonically with FLOPs and beats U-Net on ImageNet.

U-Net (concept)

2023

PixArt-α — efficient T2I DiT

Chen et al. train a text-to-image DiT at a fraction of SD's cost, using cross-attention for text.

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis (paper)

2024

Sora — DiT for video (spacetime patches)

Inflection point

OpenAI describes Sora as a diffusion transformer operating on spacetime latent video patches.

Video generation models as world simulators (Sora technical report) (paper)

2024

SD3 — MMDiT (multimodal diffusion transformer)

Inflection point

Stability AI introduces MMDiT with separate text and image token streams plus rectified flow.

LDM (concept)Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3) (paper)

2024

Flux — large open DiT

Black Forest Labs releases Flux, a leading open model based on DiT/MMDiT (12B).

Sources

Scalable Diffusion Models with Transformers (DiT)

Paper

arXiv / ICCV 2023

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Paper

arXiv / ICLR 2024

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3 / MMDiT)

Paper

arXiv / ICML 2024

Video generation models as world simulators (Sora)

Blog

OpenAI

DiT official repository

Repository

GitHub / Meta

DiT

How it works

Problem solved

Components

Implementation

Evolution

Sources

Computational complexity

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements