1. Introduction — Why Use an LLM Gateway?
The LLM Gateway is a self-hosted, intelligent proxy that sits between your applications and LLM providers. Instead of scattering provider-specific code across your codebase, you route all requests through a single endpoint and let the gateway handle the rest.
Single API, many providers
One endpoint for OpenAI, Anthropic, Google, DeepInfra, Groq, Azure OpenAI, and local models. Your application sends an OpenAI-format or Anthropic-format request; the gateway translates and routes it to any configured provider. Switching providers requires zero code changes.
Automatic failover
If a provider goes down, requests transparently route to the next-best model in the failover chain. Circuit breakers detect outages and stop sending traffic to unhealthy providers. Retry with decorrelated jitter handles transient errors. Your users never see a 503.
Cost optimization
Intelligent routing considers cost, quality, and latency to pick the best model for each request. Approximate token counting enforces budget ceilings before requests are sent. Response caching avoids duplicate LLM calls. Cost is tracked per request with full visibility.
Security & compliance
Prompt sanitization strips PII and credentials before they reach providers. An output compliance gate scans every response for leaked secrets and sensitive data. Data classification enforcement ensures classified content only reaches authorized providers. Everything runs on your infrastructure — your data never transits a third party.
Observability
Prometheus-compatible metrics at /metrics, structured JSON logs with one line per request, and full routing audit trails. You always know which provider handled each request, how long it took, and what it cost.
Rate limiting & quotas
Per-tenant RPM/TPM/RPD/concurrent limits protect against abuse. Provider quota tracking avoids 429 storms by deprioritizing providers at 90% utilization and blocking at 100%.
Format translation
Send an OpenAI-format request and have it routed to Anthropic (or vice versa) with automatic format translation — including streaming SSE events, tool calls with ID preservation, and image content.
Model management
Model aliases let you use friendly names like fast or best. Deprecation handling issues warnings before sunset and auto-redirects to replacements after. Conversation-sticky routing keeps multi-turn conversations on the same model.
Self-hosted
Runs on your infrastructure. No vendor lock-in. MIT licensed. A single binary with no runtime dependencies beyond an optional PostgreSQL database.
Back to top2. Installation
A) Binary Download (recommended)
Download the latest release for your platform using the GitHub CLI. These commands always fetch the most recent version.
Windows (PowerShell)
gh release download --repo quantum-intelligence-group/llm-gateway --pattern 'llm-gateway-windows-x64.zip' --dir $env:TEMP
Expand-Archive "$env:TEMP\llm-gateway-windows-x64.zip" -DestinationPath "$env:USERPROFILE\.llm-gateway" -Force
$env:PATH += ";$env:USERPROFILE\.llm-gateway"
macOS (ARM / Apple Silicon)
gh release download --repo quantum-intelligence-group/llm-gateway --pattern 'llm-gateway-macos-arm64.tar.gz' --dir /tmp
tar -xzf /tmp/llm-gateway-macos-arm64.tar.gz -C /usr/local/bin
chmod +x /usr/local/bin/llm-gateway
Linux (x64)
gh release download --repo quantum-intelligence-group/llm-gateway --pattern 'llm-gateway-linux-x64.tar.gz' --dir /tmp
tar -xzf /tmp/llm-gateway-linux-x64.tar.gz -C /usr/local/bin
chmod +x /usr/local/bin/llm-gateway
B) Docker
Single container
docker pull ghcr.io/quantum-intelligence-group/llm-gateway:latest
docker run -p 8930:8930 -v ./config.toml:/etc/llm-gateway/config.toml ghcr.io/quantum-intelligence-group/llm-gateway:latest
Docker Compose (with Postgres)
docker compose up
The repo includes a docker-compose.yml that builds from source and provisions Postgres.
server.host = "0.0.0.0" in your config.toml so the gateway binds to all interfaces and is reachable through the port mapping. The default 127.0.0.1 only listens inside the container.
C) Build from Source
Prerequisites
- Rust 1.90+ — install from
https://rustup.rs - A working C linker (MSVC toolchain on Windows, Xcode CLI tools on macOS,
build-essentialon Linux)
git clone https://github.com/quantum-intelligence-group/llm-gateway.git
cd llm-gateway
cargo build --release
The compiled binary will be at target/release/llm-gateway (or target\release\llm-gateway.exe on Windows).
$env:CARGO_INCREMENTAL = "0" and optionally redirect the target directory to local disk with $env:CARGO_TARGET_DIR = "C:\temp\llm-gateway-target".
Updating
Re-run the same download command — it always fetches the latest release. Or use the built-in self-update:
# Check for updates without applying
llm-gateway update --check
# Download and apply the latest release
llm-gateway update
You can enable automatic update checks at startup in your config:
[update]
mode = "auto" # "auto" checks on startup; "manual" (default) requires explicit command
After an update is applied, restart the gateway process to use the new version.
Back to topQuick Start
After building from source, follow these steps to get the gateway running.
1. Create a minimal config
Copy the example config and add at least one provider API key:
cp config.example.toml config.toml
Edit config.toml and add your provider credentials under [[providers]]. At minimum you need one provider:
[server]
host = "127.0.0.1"
port = 8930
[[providers]]
id = "openai"
provider_type = "openai"
api_key = "sk-..."
2. Start the gateway
# Start with config.toml in the current directory
llm-gateway
# Or specify a config file and port explicitly
llm-gateway --config /path/to/config.toml --port 9000
On success, the gateway logs the listen address and configured providers to the console.
3. Verify it's running
curl http://localhost:8930/health
Expected response:
{
"status": "ok",
"version": "0.3.1",
"uptime_secs": 5,
"port": 8930,
"providers": ["openai"]
}
4. Send your first request
curl http://localhost:8930/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'
The gateway proxies the request to the configured provider, applies routing and compliance rules, and returns the response in the same format. See the API Reference for the full list of endpoints and headers.
Back to top3. Configuration
Configuration is loaded from a TOML file (default: config.toml). Every section is optional and has sensible defaults. You can override the config path with --config <PATH> and the port with --port <PORT>.
[server]
| Key | Type | Default | Description |
|---|---|---|---|
host | string | "127.0.0.1" | Bind address. Use "0.0.0.0" for Docker or external access. |
port | u16 | 8930 | Listen port. |
[database]
| Key | Type | Default | Description |
|---|---|---|---|
url | string? | none | PostgreSQL connection URL. If omitted, records are stored in-memory only. |
max_connections | u32 | 5 | Maximum database connection pool size. |
store_bodies | bool | true | Whether to persist full request/response bodies. |
[update]
| Key | Type | Default | Description |
|---|---|---|---|
mode | string | "manual" | "auto" checks for updates on startup. "manual" requires the llm-gateway update command. |
[sanitization]
| Key | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable the input sanitization pipeline. |
scan_credentials | bool | true | Scan prompts for API keys, tokens, and connection strings. |
redact_pii | bool | false | Redact personally-identifiable information (emails, phone numbers, SSNs). |
compliance_gate | bool | false | Enable the output compliance gate that scans responses. |
max_classification | string? | none | Maximum allowed data classification level (public, internal, cui, itar, classified). |
[resilience]
[resilience.default_retry_policy]
| Key | Type | Default | Description |
|---|---|---|---|
max_attempts | u32 | 3 | Maximum number of attempts (including the initial request). |
base_delay_ms | u64 | 200 | Base delay in milliseconds for the first retry. |
max_delay_ms | u64 | 5000 | Maximum delay cap between retries. |
jitter | bool | true | Use decorrelated jitter (AWS-style) to spread retry load. |
retryable_statuses | [u16] | [408, 429, 500, 502, 503, 504] | HTTP status codes that trigger a retry. |
honor_retry_after | bool | true | Respect the Retry-After header from providers. |
[resilience.timeout]
| Key | Type | Default | Description |
|---|---|---|---|
connect_ms | u64 | 5000 | TCP connection timeout in milliseconds. |
default_total_ms | u64 | 300000 | Total request timeout (5 minutes). Override per-request via x-gateway-timeout. |
[resilience.default_circuit_breaker]
| Key | Type | Default | Description |
|---|---|---|---|
failure_threshold | u32 | 5 | Failures within the window that trigger the circuit to open. |
failure_window_secs | u64 | 60 | Sliding window for counting failures. |
open_duration_secs | u64 | 30 | How long the circuit stays open before transitioning to half-open. |
half_open_max_concurrent | u32 | 1 | Number of probe requests allowed in half-open state. |
[resilience.hedge]
| Key | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable hedged requests for the RealTime routing profile. |
hedge_delay_ms | u64 | 300 | Delay before firing the hedged (second) request. |
[cache]
| Key | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable exact-match response caching. |
max_entries | usize | 1000 | Maximum number of cached responses (LRU eviction). |
ttl_secs | u64 | 3600 | Time-to-live for cached entries in seconds. |
temperature_threshold | f64 | 0.3 | Only cache responses when temperature <= threshold. Higher temperatures produce non-deterministic output. |
[rate_limit]
| Key | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable per-tenant rate limiting. |
default_limits.rpm | u32 | 60 | Requests per minute. |
default_limits.tpm | u32 | 100000 | Tokens per minute. |
default_limits.rpd | u32 | 10000 | Requests per day. |
default_limits.concurrent | u32 | 10 | Maximum concurrent in-flight requests. |
Per-tenant overrides use the [rate_limit.tenant_limits.<tenant_id>] section with the same fields.
[provider_quota]
| Key | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable provider-level quota tracking. |
deprioritize_threshold | f64 | 0.9 | Fraction of quota at which the provider is deprioritized in routing (0.9 = 90%). |
provider_limits.<id>.rpm | u32 | 1000 | Per-provider requests per minute limit. |
provider_limits.<id>.tpm | u32 | 1000000 | Per-provider tokens per minute limit. |
[model_routing]
| Key | Type | Default | Description |
|---|---|---|---|
sticky_ttl_secs | u64 | 900 | How long (in seconds) a conversation sticks to the same model. Set via x-gateway-conversation-id header. |
[model_routing.aliases]
A key-value map of alias names to real model IDs:
[model_routing.aliases]
fast = "groq/llama-3.3-70b"
best = "anthropic/claude-opus-4"
code = "anthropic/claude-sonnet-4"
[[model_routing.deprecations]]
A list of deprecated models with sunset dates and optional replacements:
[[model_routing.deprecations]]
model = "anthropic/claude-3-opus"
sunset_date = "2026-06-01"
replacement = "anthropic/claude-opus-4"
Before the sunset date, the gateway adds an x-gateway-deprecation response header as a warning. After the sunset date, requests are automatically redirected to the replacement model.
[[providers]]
| Key | Type | Required | Description |
|---|---|---|---|
id | string | yes | Unique identifier for this provider instance (e.g., "anthropic", "openai"). |
provider_type | string | yes | One of: anthropic, openai, google, deep_infra, groq, azure_openai, local. |
api_key | string? | no | API key for the provider. Not needed for local models. |
endpoint | string? | no | Override the provider's default API endpoint. |
deployment | string? | no | Azure OpenAI deployment name. |
api_version | string? | no | Azure OpenAI API version. |
enabled | bool | no | Default: true. Set to false to disable without removing. |
retry_policy | object? | no | Override the default retry policy for this provider. |
circuit_breaker_config | object? | no | Override the default circuit breaker for this provider. |
Default endpoints by provider type:
| Provider Type | Default Endpoint |
|---|---|
anthropic | https://api.anthropic.com |
openai | https://api.openai.com/v1 |
google | https://generativelanguage.googleapis.com/v1beta/openai |
deep_infra | https://api.deepinfra.com/v1/openai |
groq | https://api.groq.com/openai/v1 |
azure_openai | Must be specified (includes deployment) |
local | Must be specified |
[key_map.*]
Virtual API key mapping for multi-tenant setups. Each entry maps a virtual key to provider-specific keys:
[key_map.team-alpha-key]
anthropic_key = "sk-ant-..."
openai_key = "sk-..."
label = "Team Alpha"
[key_map.team-beta-key]
anthropic_key = "sk-ant-..."
openai_key = "sk-..."
label = "Team Beta"
Clients authenticate with the virtual key in the Authorization header. The gateway resolves the correct provider key at proxy time and uses the virtual key as the tenant ID for rate limiting.
Full Example Configuration
[server]
host = "127.0.0.1"
port = 8930
[database]
url = "postgres://llm_gateway:llm_gateway@localhost/llm_gateway"
max_connections = 5
store_bodies = true
[update]
mode = "manual"
[sanitization]
enabled = true
scan_credentials = true
redact_pii = true
compliance_gate = true
max_classification = "cui"
[resilience.default_retry_policy]
max_attempts = 3
base_delay_ms = 200
max_delay_ms = 5000
jitter = true
retryable_statuses = [408, 429, 500, 502, 503, 504]
honor_retry_after = true
[resilience.timeout]
connect_ms = 5000
default_total_ms = 300000
[resilience.default_circuit_breaker]
failure_threshold = 5
failure_window_secs = 60
open_duration_secs = 30
half_open_max_concurrent = 1
[resilience.hedge]
enabled = true
hedge_delay_ms = 300
[cache]
enabled = true
max_entries = 1000
ttl_secs = 3600
temperature_threshold = 0.3
[rate_limit]
enabled = true
[rate_limit.default_limits]
rpm = 60
tpm = 100000
rpd = 10000
concurrent = 10
[rate_limit.tenant_limits.premium-team]
rpm = 300
tpm = 500000
rpd = 50000
concurrent = 25
[provider_quota]
enabled = true
deprioritize_threshold = 0.9
[provider_quota.provider_limits.openai]
rpm = 500
tpm = 2000000
[provider_quota.provider_limits.anthropic]
rpm = 400
tpm = 1000000
[model_routing]
sticky_ttl_secs = 900
[model_routing.aliases]
fast = "groq/llama-3.3-70b"
best = "anthropic/claude-opus-4"
code = "anthropic/claude-sonnet-4"
[[model_routing.deprecations]]
model = "anthropic/claude-3-opus"
sunset_date = "2026-06-01"
replacement = "anthropic/claude-opus-4"
[[providers]]
id = "anthropic"
provider_type = "anthropic"
api_key = "sk-ant-your-key-here"
[[providers]]
id = "openai"
provider_type = "openai"
api_key = "sk-your-key-here"
[[providers]]
id = "google"
provider_type = "google"
api_key = "AIza-your-key-here"
[[providers]]
id = "groq"
provider_type = "groq"
api_key = "gsk_your-key-here"
[[providers]]
id = "deepinfra"
provider_type = "deep_infra"
api_key = "your-deepinfra-key"
[[providers]]
id = "local-ollama"
provider_type = "local"
endpoint = "http://localhost:11434/v1"
[key_map.team-alpha-key]
anthropic_key = "sk-ant-alpha-key"
openai_key = "sk-alpha-key"
label = "Team Alpha"
Back to top
4. Routing Engine
The routing engine selects the best model for each request by running a 5-layer pipeline. Each layer narrows the candidate pool until a ranked list of up to 3 models is returned as a failover chain.
5-Layer Routing Pipeline
| Layer | Name | Purpose |
|---|---|---|
| 1 | Data Classification Filter | Enforces the x-data-classification header. Models whose authorized classification levels do not meet the request's classification are eliminated. Classification levels from lowest to highest: public, internal, cui, itar, classified. |
| 2 | Capability Filter | Removes models that cannot handle the request's requirements. Checks include: tool use, vision, JSON mode, extended thinking, streaming, and any required capabilities specified by the routing profile. Prevents silent downgrades. |
| 3 | Quality Scoring | Weights surviving models by quality signals (coding score, reasoning score, instruction following, arena ELO) according to the profile's quality weight (low=0.1, medium=0.3, high=0.6, very_high=0.9). |
| 4 | Cost Scoring | Weights models by price per token. Lower-cost models score higher. The profile's cost weight controls the tradeoff between cost and quality. |
| 5 | Latency Filter | Applies TTFT (Time to First Token) thresholds based on the profile's latency tolerance. realtime=500ms, interactive=2000ms, background and flexible=no limit. Models with median TTFT above the threshold are eliminated. |
Routing Profiles
Routing profiles control how the pipeline weights quality, cost, and latency. Select a profile per-request with the x-routing-profile header, or let the gateway use the default (balanced).
| Profile | Quality | Cost | Latency | Best For |
|---|---|---|---|---|
realtime | Medium | Low | Realtime (≤500ms TTFT) | Autocomplete, inline suggestions, chat with sub-second response |
interactive | High | Medium | Interactive (≤2s TTFT) | Standard chat, Q&A, moderate-complexity tasks |
batch | Medium | High | Background (no limit) | Bulk processing, data extraction, overnight jobs |
cost_optimized | Low | Very High | Flexible | High-volume, cost-sensitive workloads |
quality_optimized | Very High | Low | Background | Complex reasoning, legal/medical analysis, critical outputs |
balanced | High | Medium | Interactive | General-purpose default |
reasoning | Very High | Low | Background | Math, logic, multi-step reasoning (prefers extended thinking) |
creative | High | Medium | Interactive | Writing, brainstorming, content generation |
code | Very High | Medium | Interactive | Code generation, review, debugging |
Failover Chains
The routing engine does not return a single model. It returns a ranked list of up to 3 candidates forming a failover chain. The proxy handler walks the chain in order:
- Try candidate #1 with the full retry policy (up to
max_attempts). - If all retries fail (or the circuit breaker is open), move to candidate #2.
- If candidate #2 also fails, try candidate #3.
- If the entire chain is exhausted, return the last error to the client.
Models with an open circuit breaker are automatically filtered out of the chain, so traffic is never sent to a known-unhealthy provider.
Routing Simulator
The portal includes an interactive Routing Simulator tab that lets you visualize how the 5-layer pipeline processes a request. Open the portal at http://localhost:8930 and click the Simulator tab.
Using the Simulator
- Select a profile — choose a predefined routing profile (e.g., Coding, Architecture) to pre-fill quality weight, cost weight, latency tolerance, and required capabilities. Select “Custom” to configure each parameter manually.
- Set data classification — choose the data sensitivity level (Public, Internal, CUI, ITAR, Classified). Models not authorized for the selected level are eliminated in Layer 1.
- Check required capabilities — select capabilities the model must support (tool use, streaming, extended thinking, vision, JSON mode, citations, embeddings, batch API). Models lacking any checked capability are eliminated in Layer 2.
- Adjust weights — tune quality weight, cost weight, latency tolerance, and reasoning mode to control how the pipeline scores and ranks surviving models.
- Set max cost (optional) — enter a dollar amount per request. Models whose estimated cost exceeds this threshold are eliminated in Layer 4.
- Click “Run Simulation” — the simulator sends your parameters plus the live model catalog to the
POST /v1/routeendpoint and displays the results.
Reading the Results
The results panel shows each routing layer as a step:
- Layer 1 — Compliance Filter: models authorized for the selected data classification. Eliminated models are shown with strikethrough.
- Layer 2 — Capability Filter: models with all required capabilities. If all models are filtered, the engine falls back to the Layer 1 set (shown as an orange warning).
- Layer 3 — Quality & Latency Scoring: surviving models ranked by quality score, with models exceeding the TTFT deadline eliminated.
- Layer 4 — Cost Scoring: models re-ranked by composite score:
(quality × quality_weight) − (cost × cost_weight) + cache_bonus. - Layer 5 — Adaptive Signals: final ranking adjusted by real-time performance data (if available). When no signals exist, the ranking from Layer 4 is preserved.
At the bottom, the Failover Chain shows the top 3 models that would be tried in order: primary, fallback 1, and fallback 2. Each model chip shows its provider badge and display name, with rank numbers indicating position.
Analytics Dashboard
The Analytics tab provides usage, cost, and quality insights across all gateway traffic. It requires a PostgreSQL database to be configured — without one, the tab shows a graceful "no data" state.
Time Period Selection
Use the period buttons at the top right to filter data: 24h, 7 Days, 30 Days, or All Time. The dashboard auto-selects the appropriate time bucket (hourly for 24h, daily for longer periods).
Summary Cards
Five cards show aggregate metrics for the selected period:
- Requests — total number of proxy requests
- Total Cost — sum of estimated cost across all requests
- Tokens — combined input and output token count
- Avg Latency — mean end-to-end latency
- Error Rate — percentage of requests that returned errors
Usage Charts
- Usage Over Time — bar chart showing request volume per time bucket. Hover over a bar to see the exact request count, cost, and average latency.
- Usage by Model — horizontal bar chart ranking models by request count (top 10).
- Cost by Provider — horizontal bar chart ranking providers by total cost (top 10).
- Top Projects — table with per-project breakdown of requests, tokens, cost, latency, and error rate.
Model Quality
The quality table aggregates adaptive routing signals reported via POST /v1/signals. Metrics per model include:
- Error Rate — percentage of error signals (green ≤ 5%, orange ≤ 15%, red > 15%)
- Avg Latency — mean latency signal (green ≤ 1s, orange ≤ 3s, red > 3s)
- Acceptance — acceptance rate (green ≥ 80%, orange ≥ 50%, red < 50%)
- Parse Success — structured output parse success rate
- Gov Violations — governance violation rate
Natural-Language Query
At the bottom of the Analytics tab, the Ask a Question section lets you query your usage data in natural language. Type a question like “What are the top 5 most expensive models this week?” and click Ask.
Behind the scenes, the gateway sends your question to its own LLM proxy, which generates a SQL query against the proxy_records and quality_signals tables. The query is validated for safety (SELECT-only, table whitelist, no destructive keywords) and executed in a read-only transaction with a 5-second timeout. Results are displayed as a table below the generated SQL.
Requirements: A PostgreSQL database must be configured, and at least one LLM provider must be available to generate the SQL.
Back to top5. Resilience Features
Retry Policy
Every provider request is wrapped in a configurable retry loop:
- Decorrelated jitter (AWS-style): each retry delay is randomized between
base_delayandmin(max_delay, prev_delay * 3). This spreads retry traffic and avoids thundering herds. - Retry-After support: when a provider returns a
Retry-Afterheader (common with 429 responses), the gateway respects it instead of using the computed delay. - Per-provider override: each
[[providers]]entry can specify its ownretry_policythat overrides the global default. - Configurable retryable statuses: by default, only
408,429,500,502,503, and504trigger retries.
Circuit Breaker
A 3-state circuit breaker tracks health per provider per model:
| State | Behavior |
|---|---|
| Closed (normal) | Requests flow through. Failures are counted in a sliding window. |
| Open | All requests are immediately rejected (no network call). Entered when failures exceed the threshold within the window. Lasts for open_duration_secs. |
| Half-Open | After the open duration, a limited number of probe requests (half_open_max_concurrent) are allowed through. If probes succeed, the circuit closes. If they fail, it re-opens. |
Defaults: 5 failures in 60 seconds opens the circuit for 30 seconds, with 1 concurrent probe in half-open state.
Failover Chains
As described in the Routing Engine section, every routing decision produces a ranked chain of up to 3 candidates. The proxy walks the chain with full retry per step. Circuit-open models are filtered from chains before they are tried.
Hedged Requests
For the realtime routing profile only, the gateway can fire a hedged (second) request to the next candidate in the failover chain after a configurable delay. The first response to arrive wins; the other is discarded.
This is implemented using tokio::select! to race both futures. The hedge fires only when:
- Hedging is enabled in
[resilience.hedge] - The routing profile is
realtime - There is at least a second candidate in the failover chain
Default hedge delay: 300ms.
Timeout Tiers
| Tier | Default | Description |
|---|---|---|
| Connect | 5,000ms | TCP connection timeout to the provider. |
| TTFT | 500ms–2,000ms | Time to first token, based on the routing profile's latency tolerance. Used during routing to filter slow models. |
| Total | 300,000ms (5 min) | Total end-to-end timeout. Override per-request with x-gateway-timeout header (value in seconds). |
| Client disconnect | — | When the client closes the connection, the gateway cancels the upstream request (for non-streaming requests). |
6. Format Translation
The gateway translates between provider wire formats using a Canonical Intermediate Representation (IR). This means you can send an OpenAI-format request and have it routed to Anthropic, or vice versa, with no code changes.
Canonical IR
Internally, every request and response passes through canonical types:
CanonicalRequest— provider-agnostic request with model, messages, tools, temperature, max tokens, etc.CanonicalResponse— provider-agnostic response with content blocks, usage, stop reason, and model info.CanonicalContent— text, tool use, tool result, and image content blocks.CanonicalEvent— streaming event types (content delta, tool call delta, message start/stop, usage).
Translation Pipeline
- Inbound: The incoming request (OpenAI or Anthropic format) is parsed into a
CanonicalRequest. - Routing: The routing engine selects the best model/provider.
- Outbound: The
CanonicalRequestis serialized into the target provider's wire format. - Response: The provider's response is parsed back into a
CanonicalResponseand serialized into the client's expected format.
Streaming SSE Translation
For streaming requests, each SSE event from the provider is parsed into a CanonicalEvent, then emitted in the client's expected SSE format. The translation is bidirectional across OpenAI, Anthropic, and Google formats.
Tool Call Translation
Tool calls and tool results are translated across all supported formats with tool call ID preservation. A tool call made in OpenAI format can be correctly resolved when the response comes from Anthropic, and vice versa.
Capability Gates
Each provider adapter reports its capabilities (tool use, vision, JSON mode, streaming, etc.). The routing engine uses these to prevent silent downgrades — if a request requires tool use, models that do not support it are eliminated before routing.
Back to top7. Security & Compliance
Input Sanitization
When sanitization.enabled = true, every prompt is scanned before it reaches the provider:
- Credential scanning (
scan_credentials): detects API keys, tokens, connection strings, and other secrets. Matches are redacted with[REDACTED]. - PII redaction (
redact_pii): detects email addresses, phone numbers, Social Security numbers, and similar PII patterns.
Output Compliance Gate
When sanitization.compliance_gate = true, every response (including streaming chunks) is scanned for:
- Leaked credentials (API keys, tokens, connection strings)
- PII in model output
- Content that violates the configured classification level
When a violation is detected, the gateway takes one of three actions based on the ViolationAction:
| Action | Behavior |
|---|---|
Block | The entire response is blocked and a compliance error is returned to the client. |
Redact | The offending content is replaced with [REDACTED] and the response is delivered. |
LogOnly | The violation is logged and recorded in the routing trace, but the response is delivered unmodified. |
Data Classification Enforcement
The x-data-classification request header declares the sensitivity level of the data being sent. The routing engine ensures the request is only routed to providers and models authorized for that classification level.
Classification levels (lowest to highest):
public— publicly available datainternal— internal business data (default)cui— Controlled Unclassified Informationitar— International Traffic in Arms Regulations dataclassified— classified information
Streaming Buffer
For streaming responses, the compliance gate buffers SSE events and scans in real-time, ensuring that sensitive content is caught before it reaches the client.
Back to top8. Observability
Prometheus Metrics
Available at GET /metrics in Prometheus exposition format. Key metrics include:
| Metric | Type | Description |
|---|---|---|
gateway_requests_total | Counter | Total requests by provider, model, and status code. |
gateway_request_duration_seconds | Histogram | End-to-end request latency by provider. |
gateway_tokens_total | Counter | Token counts by direction (input/output) and provider. |
gateway_cost_dollars | Counter | Estimated cost in USD by provider and model. |
gateway_cache_hits_total | Counter | Cache hit count. |
gateway_cache_misses_total | Counter | Cache miss count. |
gateway_failovers_total | Counter | Failover events by source and target provider. |
Structured JSON Logging
Every request produces a single structured JSON log line (RequestLogEntry) containing:
- Request ID, timestamp, duration
- Provider and model selected
- Token counts (input, output)
- Estimated cost
- HTTP status code
- Routing profile used
- Whether cache was hit
- Number of failover attempts
- Data classification level
- Tenant ID (if applicable)
Logging uses the standard tracing framework. Configure the log level via the RUST_LOG environment variable (e.g., RUST_LOG=info).
Routing Audit Trail
Every request generates a full RoutingTrace that records:
- Which models entered and survived each routing pipeline layer
- Why specific models were eliminated
- The failover chain that was constructed
- Each provider attempt (status, latency, errors)
- Which model ultimately handled the request
- Any compliance findings
Query traces via GET /v1/audit (paginated list) and GET /v1/audit/{request_id} (full detail).
9. API Reference
Endpoints
| Method | Path | Description |
|---|---|---|
POST | /v1/messages | Anthropic-format proxy. Send Anthropic Messages API requests; the gateway routes to the best provider. |
POST | /v1/chat/completions | OpenAI-format proxy. Send OpenAI Chat Completions requests; the gateway routes to the best provider. |
GET | /v1/models | List available models (OpenAI-compatible format). |
GET | /v1/catalog | Full model catalog with detailed model cards (capabilities, pricing, quality signals). Reflects the live runtime catalog: startup seed plus any sync results (B-042). |
POST | /v1/catalog/sync | Trigger an immediate catalog sync against every enabled provider's /models endpoint. Returns per-provider discovered/added/retired/unchanged counts and errors (B-042). |
GET | /v1/catalog/sync-status | Returns last_synced (RFC 3339), model_count, enabled_providers, and configured sync_interval_secs. Used by the portal Last-synced display (B-042). |
GET | /v1/routing-profiles | List all available routing profiles with their configuration. |
GET | /v1/providers | List configured providers and their status. |
POST | /v1/providers | Add a new provider at runtime (does not persist to config file). |
DELETE | /v1/providers/{id} | Remove a provider at runtime. |
GET | /v1/projects | List active projects (groupings of requests). |
POST | /v1/route | Dry-run routing decision. Returns the routing output without making a provider call. |
GET | /v1/audit | List recent routing traces (paginated). |
GET | /v1/audit/{request_id} | Full routing trace for a specific request. |
GET | /v1/stats | Gateway statistics (total requests, tokens, cost, provider breakdown). |
GET | /v1/requests | Paginated request history. |
GET | /v1/requests/{id} | Single request detail with full request/response bodies. |
GET | /health | Health check. Returns 200 OK with gateway status. |
GET | /metrics | Prometheus-format metrics. |
GET | /v1/analytics | Usage analytics with time-series, per-model, per-provider, and per-project breakdowns. Query params: from, to, bucket, model, provider, project, source. |
GET | /v1/quality | Per-model quality metrics aggregated from routing signals. |
GET | /v1/quality/{model_card_id} | Time-series quality metrics for a specific model. |
POST | /v1/nl-query | Natural-language query. Accepts {"question":"..."}, generates and executes a read-only SQL query. |
POST | /v1/signals | Record a quality signal. Body: {"model_card_id","task_type","signal_type","value"}. |
GET | /v1/signals | Drain and return all pending quality signals. |
POST | /v1/alerts | Check for model degradation alerts from signal aggregations. |
POST | /v1/catalog/{model_id}/scores | Update quality scores for a model. Body: {"coding_score","reasoning_score","document_score","classification_score","instruction_following","arena_elo"}. Returns previous and updated scores with rollback history. |
GET | /v1/analytics/errors | Error dashboard with customer-impact classification. Returns errors classified as customer-impacting (request failed) or mitigated (failover succeeded). Query params: from, to, model, provider, source. |
GET | /v1/analytics/requesters | Per-requester analytics with success rates and model breakdown. Requesters identified by x-source header. Query params: from, to, model, provider, source. |
DELETE | /v1/analytics | Clear all analytics data (proxy records, quality signals, shadow results). |
GET | /docs | Interactive API documentation (Scalar viewer). |
GET | /openapi.json | OpenAPI 3.1 specification (machine-readable). |
GET | /pp/version | PPF protocol version info and negotiation. |
Request Headers
| Header | Type | Description |
|---|---|---|
x-routing-profile | string | Override the routing profile for this request. One of: realtime, interactive, batch, cost_optimized, quality_optimized, balanced, reasoning, creative, code. |
x-data-classification | string | Data classification level: public, internal, cui, itar, classified. Default: internal. |
x-gateway-timeout | integer | Override the total request timeout, in seconds. |
x-gateway-failover | string | Set to disabled to disable failover for this request. Only the first candidate will be tried. |
x-gateway-conversation-id | string | Sticky routing key. Requests with the same conversation ID are routed to the same model for up to sticky_ttl_secs. |
x-ppf-protocol-version | string | PPF protocol version negotiation header. |
Idempotency-Key | string | Idempotency key for POST proxy requests. Duplicate requests with the same key and tenant return the cached response within TTL (default 24h). |
x-gateway-scope | string | Scope header for signal authentication. Include signal-submission to authorize POST /v1/signals when signal auth is required. |
Response Headers
| Header | Value | Description |
|---|---|---|
x-gateway-model | string | The model ID that actually handled the request. |
x-gateway-provider | string | The provider ID that handled the request. |
x-gateway-cache | hit | semantic-hit | Present when the response was served from exact-match or semantic cache. |
x-gateway-semantic-similarity | float | Cosine similarity score when served from semantic cache. |
x-gateway-idempotent | true | Present when the response was served from the idempotency cache. |
x-context-window-limit | integer | Present when the request was rejected or truncated due to context window limits (HTTP 413). |
x-gateway-circuit | open | Present when the request was rejected by a circuit breaker. |
x-gateway-deprecation | date string | Present when the requested model has a deprecation date. Value is the sunset date. |
x-ppf-negotiated-version | string | PPF negotiated protocol version. |
x-ppf-deprecated | true | Present when the PPF protocol version used is deprecated. |
10. CLI Reference
llm-gateway [OPTIONS] [COMMAND]
Commands:
serve Start the gateway server (default if no command given)
update Check for and apply updates from GitHub Releases
Global Options:
-c, --config <PATH> Path to TOML config file [default: config.toml]
-p, --port <PORT> Override the listen port from config
-h, --help Print help information
-V, --version Print version
Update Options:
--check Only check for updates, don't apply them
Examples
# Start with default config
llm-gateway
# Explicitly start the server (same as above)
llm-gateway serve
# Start with a custom config and port
llm-gateway --config /etc/llm-gateway/production.toml --port 9000
# Check for updates without applying
llm-gateway update --check
# Download and apply the latest update
llm-gateway update
Back to top
Appendix A: Release Notes
v0.5.2 (2026-05-23)
- Chat tab now shows the full dynamic catalog (B-055):
GET /v1/modelspreviously returned a hardcoded per-ProviderTypelist (e.g. 2 Anthropic IDs regardless of how many B-042 had discovered). It now reads fromstate.catalogand emits every active card matching the configured provider. Retired cards are filtered out; deprecated cards remain visible (B-030 reroutes them on send). Local providers still live-query their upstream/v1/models; Azure providers still report the configured deployment. - 343 tests (320 library + 23 binary).
v0.5.1 (2026-05-23)
- ModelCard consumer audit + 3 silent-fallback fixes (B-052): With B-042's dynamic catalog writing cards whose numeric fields may be at
Default, three silent-fallback gaps were latent and would have surfaced as soon as freshly-discovered models entered the runtime catalog. Routing Layer 4 cost now treats zero pricing as "unknown" (0.5 neutral) rather than "free" (0.0);transform_for_anthropicfalls back tomax_tokens: 4096whencard.max_output_tokens == 0rather than injectingmax_tokens: 0(Anthropic 400);transform_for_openaiskips the clamp whencard.max_output_tokens == 0rather than clamping client values to 0 (OpenAI 400). - Portal Models tab: Context, Max Output, Input $/M, Output $/M cells now render
—(muted) for default-zero fields instead of0or$0.00. Distinguishes "no data" from "literally zero". - New audit doc:
docs/architecture/MODELCARD_CONSUMERS.mddocuments everyModelCardreader and its behavior on default-valued fields. Captures the cross-cutting principle (D-34): every consumer treats a default-zero numeric field as "unknown," not "the literal value zero". - 341 tests (320 library + 21 binary). Clippy clean.
v0.5.0 (2026-05-22)
- Dynamic model catalog (B-042): Per-provider
/modelsauto-discovery for OpenAI, Anthropic, Azure, DeepInfra, Groq, Google, and Local. Daily background sync, reactive re-sync on upstream HTTP 404 (rate-limited per provider), operator overrides via[catalog.overrides."{provider_id}/{model_id}"], manual trigger viaPOST /v1/catalog/sync, status viaGET /v1/catalog/sync-status. Portal Models tab has a Last-synced indicator and Sync-now button. - Background auto-update checker (B-043): When
[update].mode = "auto", periodic check (check_interval_secs, default 300s) downloads and applies updates, then exits cleanly for launchd/systemd to restart. Production deployments must run under a service manager. - Linux auto-update works (B-044): Added the missing Linux x86_64 arm to
platform_asset_namematching the release workflow's artifact.llm-gateway updatenow works on Linux. - CI fully green (B-045): 91 pre-existing clippy errors cleared via mechanical refactor (R-2.1 preserved). Build, test, and clippy steps all green.
- 331 tests (313 library + 18 binary).
v0.4.0 (2026-05-21)
- Semantic response cache (B-037): Extends exact-match cache with embedding-based similarity search. Configurable cosine similarity threshold (default 0.95), scoped by tenant and model class. Skips requests with tool calls or image attachments. Configure via
[cache.semantic]. - Adaptive routing maturation (B-038): Exploration strategies (EpsilonGreedy, ThompsonSampling, UCB1) route a configurable percentage of requests to non-top-ranked candidates. Shadow traffic duplicates requests to a challenger model for comparison. Signal authentication via
Authorizationheader orx-gateway-scope: signal-submission. Eval-driven quality scores viaPOST /v1/catalog/{model_id}/scoreswith rollback history. - Request idempotency (B-039):
Idempotency-Keyheader on proxy requests caches responses for 24h (configurable). Tenant-scoped with LRU eviction. - Context window enforcement (B-040): Rejects requests exceeding model context window (HTTP 413) with optional truncation mode.
- Capability fallback policy (B-041): Per-capability
fail_closed/fail_openpolicy in routing Layer 2. Tool use and vision default tofail_closed; others default tofail_open. - 309 tests (291 library + 18 binary).
Full release notes for earlier versions: see docs/reports/release-notes-v{VERSION}.html in the repository.