The generator G(z;θ_G) maps a noise vector z (usually 𝒩(0,I) or U(−1,1)) into a sample. The discriminator D(x;θ_D) returns the probability that x is real. Alternating training: (1) D step — maximize log D(x) + log(1 − D(G(z))) on a minibatch of real and fake samples (binary classification); (2) G step — minimize log(1 − D(G(z))) or, in practice, maximize log D(G(z)) (non-saturating loss, better gradients). Gradients flow through D into G. Variants change the loss and regularization: WGAN (Wasserstein distance + weight clipping), WGAN-GP (gradient penalty), LSGAN (least squares), hinge loss, spectral normalization. Architectural variants: DCGAN (convolutions), conditional GAN (condition c into both networks), Pix2Pix/CycleGAN (image-to-image), StyleGAN (style-based generator with a latent mapping w), BigGAN (large scale + self-attention). Training is delicate — it requires balancing the power of G and D.
Earlier generative models (VAE) produced blurry samples due to the averaging nature of reconstruction losses, and explicit-density models were computationally expensive. GANs bypass explicit density modeling — learning the distribution implicitly via the discriminator signal — leading to sharp, realistic samples and fast single-pass generation.
A network transforming a latent vector z into a sample G(z). Trained to fool the discriminator. In StyleGAN preceded by a mapping network z → w.
A binary classifier (or a critic in WGAN returning a scalar) providing the learning signal to the generator. Usually discarded after training.
The loss defining the game: vanilla (BCE), non-saturating, Wasserstein, least squares, hinge. The choice strongly affects stability.
Official
The input distribution (usually 𝒩(0,I)). In StyleGAN mapped into a style space W with better disentanglement properties.
The generator produces limited (or single-mode) sample diversity, ignoring parts of the data distribution.
The minimax game may not converge — losses oscillate and quality fluctuates; the G/D power balance is delicate.
When the discriminator is too strong, the generator gradient vanishes (log(1−D(G(z)))→0).
The lack of explicit likelihood complicates evaluation; metrics (FID, IS) are imperfect and sensitive.
Goodfellow et al. introduce the two-network minimax game as a new generative paradigm.
Radford et al. establish architectural patterns enabling stable image GAN training.
Arjovsky et al. and Gulrajani et al. introduce the Wasserstein distance and gradient penalty, mitigating mode collapse.
Isola et al. and Zhu et al. enable paired and unpaired image translation.
Karras et al. (progressive growing) and Brock et al. (large scale + self-attention) reach high-resolution, photorealistic samples.
Karras et al. introduce the style space W and feature control, setting SoTA in face generation.
Dhariwal & Nichol show diffusion models surpass GANs in quality and diversity, ending the GAN-dominance era.
Adversarial loss remains a component of diffusion VAEs; GANs dominate audio vocoders and single-pass super-resolution, and in diffusion distillation (e.g. adversarial distillation).
Vanilla/non-saturating / Wasserstein / LSGAN / hinge — critical for stability.
Dimension of the noise vector z (typically 100-512).
Number of discriminator steps per generator step (WGAN-GP uses 5:1).
Spectral norm, gradient penalty, R1, weight clipping — stabilize training.
DCGAN / StyleGAN / BigGAN / Pix2Pix / CycleGAN — determines capabilities and cost.
Generator and discriminator are fully active; after training, inference uses only the generator (single forward pass).
Inference is a single generator forward pass — fully parallel and fast (no iterative denoising like diffusion). Training is batch-parallel, but the alternating G/D steps introduce a sequential dependency between updates.
Convolutional G and D fit tensor cores ideally. StyleGAN/BigGAN are trained on GPU clusters; generator inference is a fast single forward pass.
BigGAN was trained on TPU; convolutions and attention map well.