Training: the condition c (e.g. a text embedding) is randomly replaced with an empty token ∅ with probability p_uncond (typically 0.1-0.2). This way one set of weights learns both the conditional prediction ε_θ(x,c) and the unconditional ε_θ(x,∅). Inference: at each denoising (or generation) step TWO passes are computed — conditional and unconditional — and the result is linearly extrapolated: ε̃ = ε_θ(x,∅) + w·(ε_θ(x,c) − ε_θ(x,∅)). Equivalently ε̃ = (1−w)·ε_θ(x,∅) + w·ε_θ(x,c) in some conventions. The vector (ε_θ(x,c) − ε_θ(x,∅)) points in the "condition direction"; the scale w amplifies it. w = 1 means no guidance (purely conditional), w > 1 amplifies. Cost: ~2× inference compute (two forward passes), although batching the conditional and unconditional passes mitigates the overhead.
Conditional generative models often follow the condition (prompt) weakly, producing content only loosely related to it. Earlier classifier guidance required a separate classifier trained on noisy data — costly and difficult. CFG strongly amplifies conditioning using only the generative model itself, with no extra networks.
Randomly replacing the condition c with an empty token ∅ at probability p_uncond (usually 0.1-0.2) during training.
Two model passes per step: ε_θ(x,c) and ε_θ(x,∅). Often batched together.
ε̃ = ε_θ(x,∅) + w·(ε_θ(x,c) − ε_θ(x,∅)). The scale w controls conditioning strength.
Official
Large guidance scale causes oversaturated colors, posterization, and unnatural textures.
Two forward passes per step (conditional + unconditional) double the compute cost.
High w increases condition fidelity at the cost of sample diversity.
Dhariwal & Nichol introduce guidance using a separate classifier trained on noisy data.
Ho & Salimans show the separate classifier is unnecessary — a joint conditional/unconditional model suffices.
CFG becomes the standard conditioning mechanism across all leading text-to-image models.
Lin et al. diagnose oversaturation at high w and propose rescale + zero-SNR schedule.
Distilling CFG into a single forward pass removes the 2× compute overhead (e.g. in few-step models).
Conditioning strength. w=1 no guidance, typically 5-12 for image, 1-3 for video. Too high → artifacts.
Condition-dropout probability during training (typically 0.1-0.2).
Variance renormalization coefficient against oversaturation at high w.
Constant vs time-varying guidance scale (e.g. disabling CFG in late steps).
The full model is active twice per inference step (conditional and unconditional).
The conditional and unconditional passes can be computed in parallel within one batch, but CFG operates inside the base model's sequential denoising/generation loop.
CFG is a wrapper over a diffusion/AR model — it inherits the base model's hardware profile (GPU tensor cores).
The guidance logic itself is a cheap linear tensor combination — it imposes no specific hardware requirement.