An application client calls the gateway instead of the model provider directly, usually through an OpenAI-compatible interface. The gateway authenticates the request with a virtual key bound to a team or application and checks budget and limits. It then selects the target model and provider per its routing configuration (preferred model, A/B test, cost, availability). It checks the cache (deterministic keys or semantic embedding match) and returns the stored result on a hit. On a miss, it issues the upstream call with retry and timeout; on error or rate-limit, it switches to a fallback. Optionally it runs input guardrails (prompt filters, PII redaction) and output guardrails (content filters, schema validation). It logs the request and response with token counts, cost, latency and a trace ID, exporting metrics and traces to observability systems.
Applications consuming AI models must integrate with many providers with different SDKs, request formats, auth schemes, rate-limit policies and cost models. Without a gateway, this logic โ together with retries, fallback, caching, rate limiting, PII redaction and observability โ is duplicated in every application, and cost and security control is scattered.
Selects the target model and provider for a request based on rules (preference, cost, availability, A/B, load balancing).
Translates the unified request (usually OpenAI-compatible) into the target provider's native format, and the response back.
Official
Authenticates the client with a virtual key and maps it to real provider keys, including budget and permissions.
Stores LLM responses keyed deterministically (prompt and parameter hash) or semantically (embedding comparison).
Official
Enforces per-key/team limits and switches the request to an alternative model or provider on error or quota breach.
Input and output filters: PII redaction, prompt blocks, response schema validation, content filtering.
Official
Collects logs, metrics (tokens, cost, latency, errors) and distributed traces for every request.
Automatically switching to a weaker model on primary provider error can silently lower response quality.
Too low a similarity threshold causes the semantic cache to hit on different intents and return misleading answers.
A gateway outage halts all AI traffic in the organization, even when upstream providers are healthy.
Full request/response logging without redaction can persist PII, secrets and customer IP.
Each hop adds latency; a gateway in a different region than the provider can noticeably worsen time-to-first-token.
Classic API Gateway popularized in microservice architectures as a single entry point with auth, rate limiting and routing.
Cloudflare launches AI Gateway (September 2023) as a proxy for LLM calls with analytics, caching and rate limiting; an AI-dedicated gateway becomes a product.
LiteLLM (BerriAI) and Portkey popularize an OpenAI-compatible proxy to many providers with fallback, virtual keys and caching.
Kong adds native AI plugins (ai-proxy, ai-prompt-guard, ai-rate-limiting), bringing LLM logic into mature L7 gateways.
How a model/provider is chosen: pinned, weighted, cost-based, latency-based, or a fallback chain.
No cache, deterministic cache (hash) or semantic cache (embeddings with similarity threshold).
Limits per virtual key, team, model โ in requests/minute and tokens/minute, plus daily/monthly budgets.
Chain of alternative models/providers triggered by errors, timeouts or rate-limit breaches.
Enabled input/output filters: PII redaction, prompt blockers, content filters, schema validation.
Per-request routing based on policy (preferred model, cost, latency, availability, A/B, fallback).
Stateless proxy โ instances scale horizontally; the bottleneck is upstream provider limits and the shared cache.
The gateway is a lightweight, I/O-bound proxy; it runs on commodity CPUs without accelerators.