GPU Tensor CoresPRIMARY
MLLM consists of Transformers (visual encoder, connector, LLM backbone) — all relying on matrix multiplication (GEMM), which is accelerated by GPU Tensor Cores (NVIDIA A100, H100). In practice, training and inference of MLLMs require GPUs with large HBM capacity (40–80 GB).
Training large MLLMs (>7B parameters) requires multi-GPU setups with Tensor Parallelism or Pipeline Parallelism. Inference for 7–13B models is feasible on 24–40 GB GPUs with 4-bit quantization.
TPUGOOD
TPU v4/v5 are used to train MLLMs at Google (Gemini). They offer high throughput for GEMM operations and efficient scaling via TPU Pods.
TPU-friendly implementations require specific frameworks (JAX/XLA). Flamingo and Gemini were trained on TPUs.