GPU Tensor CoresPRIMARY
Most production LLM and diffusion model deployments on Inference Endpoints use NVIDIA GPU instances (A10G, L4, A100, H100). GPU inference is required for practical throughput on large transformer models.
Available GPU instances vary by cloud provider and region. Pricing starts from $0.5 per GPU/hr.
CPU AVXGOOD
CPU instances are supported and suitable for smaller models (classification, embeddings, NLP tasks under ~1B parameters) where GPU cost is not justified. Pricing starts from $0.032 per CPU core/hr.
CPU-based endpoints use Intel Xeon or comparable processors with AVX instructions for accelerated matrix operations.
TPUPOSSIBLE
Google Cloud TPU v5e support was added in 2024 for LLM inference (Gemma, Llama, Mistral) via Optimum TPU. As of 2024, TPU availability on Inference Endpoints has been suspended pending further updates.
TPU support availability should be verified in current Hugging Face documentation.