GPU Tensor CoresPRIMARY
Matrix operations dominating attention computations (Q·Kᵀ, attention·V) and FFN layers are GEMM operations optimally executed by GPU tensor cores (NVIDIA A100/H100/H200). The Transformer architecture is the de facto benchmark driving tensor core requirements in modern GPUs.
NVIDIA designs tensor cores and libraries (cuBLAS, cuDNN, Flash Attention CUDA kernels) around the operations that dominate Transformer workloads. Training large models is performed almost exclusively on A100/H100-class GPUs or newer.
TPUGOOD
Google TPU v4/v5 are designed for efficient execution of Transformer matrix operations. Several key models (PaLM, Gemini) were trained exclusively on TPUs.
XLA compilation on TPU requires static tensor shapes, which constrains the implementation of dynamic padding and variable sequence lengths.