An Inference Endpoint is built from a model hosted on the Hugging Face Hub. The user selects a model, task, cloud provider, region, instance type, and security settings, while Hugging Face manages the full lifecycle of the container running the model. The endpoint exposes an HTTP API and can be configured for autoscaling, scale-to-zero, private networking, analytics, and version updates. Integration with the huggingface_hub library also enables programmatic creation and management of endpoints.
Inference Endpoints address the complexity and cost of deploying AI models to production. Without such a service, teams must independently manage containers, environment configuration, scaling, security, monitoring, and infrastructure selection. Inference Endpoints simplify this process by providing a ready-made, managed inference layer for models from the Hub, enabling teams to deploy models faster and serve production traffic without building their own serving platform from scratch.
The trained model parameters and associated files stored and versioned on the Hugging Face Hub repository. Downloaded securely at endpoint startup and loaded by the inference engine. Users can optionally pin a specific commit revision.
The software runtime that loads the model and handles inference requests. Packaged as a Docker container. Hugging Face provides prebuilt containers for popular engines: Text Generation Inference (TGI), vLLM, SGLang, and Text Embeddings Inference (TEI). Users can also specify a custom Docker image.
Official
Monitors hardware utilization (CPU or GPU, default threshold 80% over 1 minute) or pending request count and adds or removes replicas accordingly. Scaling up occurs every minute, scaling down every 2 minutes with a 300-second stabilization window. Scale-to-zero reduces replicas to 0 after a configurable inactivity period (default 1 hour), with a cold-start delay on the next request.
Enforces one of three access levels: Public (no authentication required), Protected (requires a valid Hugging Face access token), or Private (accessible only via a private VPC connection using AWS PrivateLink or Azure equivalent, not exposed to the public internet). Data in transit is encrypted with TLS/SSL.
The URL exposed to clients for inference requests. Accepts HTTP requests with payloads in supported content types (application/json, image/*, audio/*, text/*). The API format follows the inference engine's conventions (e.g., OpenAI-compatible chat completions for TGI/vLLM LLM endpoints).
When an endpoint is configured with scale-to-zero and receives a request after the inactivity period, it must restart from 0 replicas. During initialization, the proxy returns HTTP 503. Cold start time depends on model size and instance type and can range from tens of seconds to several minutes for large LLMs.
The autoscaler checks CPU/GPU utilization every minute and scales up by adding replicas, with a 300-second stabilization window after scale-down. Sudden traffic spikes may cause temporary request queuing or errors before new replicas become available.
Deploying a large model on an instance with insufficient VRAM or RAM causes the container to fail to load the model. Conversely, over-provisioning an unnecessarily large GPU instance increases cost without performance benefit.
If the auto-detected task or container type is incorrect for the deployed model, the endpoint may start but return incorrect results or fail requests. Custom models or non-standard architectures may not be auto-recognized.
Hugging Face launched Inference Endpoints in October 2022 as a dedicated, managed inference serving product, replacing the paid tier of the Serverless Inference API. Initial support for AWS and Azure, with CPU and GPU instances.
Hugging Face added Google Cloud TPU v5e support to Inference Endpoints in partnership with Google Cloud, enabling cost-effective inference for LLMs including Gemma, Llama, and Mistral via Optimum TPU and TGI.
Inference Endpoints extended built-in support to multiple open-source inference engines (vLLM, SGLang, Text Embeddings Inference), in addition to TGI, allowing users to choose the engine best suited to their model and workload.
Inference Endpoints nie są architekturą modelu, lecz zarządzaną warstwą inferencyjną. Ich charakterystyka obliczeniowa zależy od wybranego sprzętu, liczby replik, ustawień autoskalowania i rodzaju serwowanego modelu.
Inference Endpoints is not a benchmark or modeling technique. Evaluation focuses on operational parameters such as latency, throughput, cost, availability, autoscaling effectiveness, and ease of deploying models to production.
The cloud infrastructure provider on which the endpoint runs. Determines available instance types and supported regions.
The compute hardware used to run the endpoint. Options include CPU instances and GPU instances of various sizes (e.g., NVIDIA A10G, L4, A100). Determines throughput, latency, and cost.
Defines the lower and upper bounds for the autoscaler. Min replicas set to 0 enables scale-to-zero. Min replicas ≥ 2 recommended for high-availability production workloads.
The Docker container or inference engine used to serve the model. Auto-selected by Hugging Face based on model type; can be overridden.
Controls network access and authentication requirements for the endpoint.
Each inference request is processed by a full model forward pass on the assigned replica. No conditional routing or sparse activation at the serving layer level.
Individual inference requests are independent HTTP calls handled by separate replicas. Multiple replicas run in parallel to handle concurrent traffic. Within each replica, the inference engine may use batching and parallel GPU execution.
Most production LLM and diffusion model deployments on Inference Endpoints use NVIDIA GPU instances (A10G, L4, A100, H100). GPU inference is required for practical throughput on large transformer models.
CPU instances are supported and suitable for smaller models (classification, embeddings, NLP tasks under ~1B parameters) where GPU cost is not justified. Pricing starts from $0.032 per CPU core/hr.
Google Cloud TPU v5e support was added in 2024 for LLM inference (Gemma, Llama, Mistral) via Optimum TPU. As of 2024, TPU availability on Inference Endpoints has been suspended pending further updates.