LLM Gateway User Manual

1. Introduction — Why Use an LLM Gateway?

The LLM Gateway is a self-hosted, intelligent proxy that sits between your applications and LLM providers. Instead of scattering provider-specific code across your codebase, you route all requests through a single endpoint and let the gateway handle the rest.

Single API, many providers

One endpoint for OpenAI, Anthropic, Google, DeepInfra, Groq, Azure OpenAI, and local models. Your application sends an OpenAI-format or Anthropic-format request; the gateway translates and routes it to any configured provider. Switching providers requires zero code changes.

Automatic failover

If a provider goes down, requests transparently route to the next-best model in the failover chain. Circuit breakers detect outages and stop sending traffic to unhealthy providers. Retry with decorrelated jitter handles transient errors. Your users never see a 503.

Cost optimization

Intelligent routing considers cost, quality, and latency to pick the best model for each request. Approximate token counting enforces budget ceilings before requests are sent. Response caching avoids duplicate LLM calls. Cost is tracked per request with full visibility.

Security & compliance

Prompt sanitization strips PII and credentials before they reach providers. An output compliance gate scans every response for leaked secrets and sensitive data. Data classification enforcement ensures classified content only reaches authorized providers. Everything runs on your infrastructure — your data never transits a third party.

Observability

Prometheus-compatible metrics at /metrics, structured JSON logs with one line per request, and full routing audit trails. You always know which provider handled each request, how long it took, and what it cost.

Rate limiting & quotas

Per-tenant RPM/TPM/RPD/concurrent limits protect against abuse. Provider quota tracking avoids 429 storms by deprioritizing providers at 90% utilization and blocking at 100%.

Format translation

Send an OpenAI-format request and have it routed to Anthropic (or vice versa) with automatic format translation — including streaming SSE events, tool calls with ID preservation, and image content.

Model management

Model aliases let you use friendly names like fast or best. Deprecation handling issues warnings before sunset and auto-redirects to replacements after. Conversation-sticky routing keeps multi-turn conversations on the same model.

Self-hosted

Runs on your infrastructure. No vendor lock-in. MIT licensed. A single binary with no runtime dependencies beyond an optional PostgreSQL database.

2. Installation

A) Binary Download (recommended)

Download the latest release for your platform using the GitHub CLI. These commands always fetch the most recent version.

Windows (PowerShell)

gh release download --repo quantum-intelligence-group/llm-gateway --pattern 'llm-gateway-windows-x64.zip' --dir $env:TEMP
Expand-Archive "$env:TEMP\llm-gateway-windows-x64.zip" -DestinationPath "$env:USERPROFILE\.llm-gateway" -Force
$env:PATH += ";$env:USERPROFILE\.llm-gateway"

macOS (ARM / Apple Silicon)

gh release download --repo quantum-intelligence-group/llm-gateway --pattern 'llm-gateway-macos-arm64.tar.gz' --dir /tmp
tar -xzf /tmp/llm-gateway-macos-arm64.tar.gz -C /usr/local/bin
chmod +x /usr/local/bin/llm-gateway

Linux (x64)

gh release download --repo quantum-intelligence-group/llm-gateway --pattern 'llm-gateway-linux-x64.tar.gz' --dir /tmp
tar -xzf /tmp/llm-gateway-linux-x64.tar.gz -C /usr/local/bin
chmod +x /usr/local/bin/llm-gateway

B) Docker

Single container

docker pull ghcr.io/quantum-intelligence-group/llm-gateway:latest
docker run -p 8930:8930 -v ./config.toml:/etc/llm-gateway/config.toml ghcr.io/quantum-intelligence-group/llm-gateway:latest

Docker Compose (with Postgres)

docker compose up

The repo includes a docker-compose.yml that builds from source and provisions Postgres.

Docker networking note When running inside Docker, set server.host = "0.0.0.0" in your config.toml so the gateway binds to all interfaces and is reachable through the port mapping. The default 127.0.0.1 only listens inside the container.

C) Build from Source

Prerequisites

Rust 1.90+ — install from https://rustup.rs
A working C linker (MSVC toolchain on Windows, Xcode CLI tools on macOS, build-essential on Linux)

git clone https://github.com/quantum-intelligence-group/llm-gateway.git
cd llm-gateway
cargo build --release

The compiled binary will be at target/release/llm-gateway (or target\release\llm-gateway.exe on Windows).

Windows network-drive note If your repo is on a network drive, incremental compilation may fail. Set $env:CARGO_INCREMENTAL = "0" and optionally redirect the target directory to local disk with $env:CARGO_TARGET_DIR = "C:\temp\llm-gateway-target".

Updating

Re-run the same download command — it always fetches the latest release. Or use the built-in self-update:

# Check for updates without applying
llm-gateway update --check

# Download and apply the latest release
llm-gateway update

You can enable automatic update checks at startup in your config:

[update]
mode = "auto"    # "auto" checks on startup; "manual" (default) requires explicit command

After an update is applied, restart the gateway process to use the new version.

Quick Start

After building from source, follow these steps to get the gateway running.

1. Create a minimal config

Copy the example config and add at least one provider API key:

cp config.example.toml config.toml

Edit config.toml and add your provider credentials under [[providers]]. At minimum you need one provider:

[server]
host = "127.0.0.1"
port = 8930

[[providers]]
id = "openai"
provider_type = "openai"
api_key = "sk-..."

2. Start the gateway

# Start with config.toml in the current directory
llm-gateway

# Or specify a config file and port explicitly
llm-gateway --config /path/to/config.toml --port 9000

On success, the gateway logs the listen address and configured providers to the console.

3. Verify it's running

curl http://localhost:8930/health

Expected response:

{
  "status": "ok",
  "version": "0.3.1",
  "uptime_secs": 5,
  "port": 8930,
  "providers": ["openai"]
}

4. Send your first request

curl http://localhost:8930/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The gateway proxies the request to the configured provider, applies routing and compliance rules, and returns the response in the same format. See the API Reference for the full list of endpoints and headers.

3. Configuration

Configuration is loaded from a TOML file (default: config.toml). Every section is optional and has sensible defaults. You can override the config path with --config <PATH> and the port with --port <PORT>.

[server]

Key	Type	Default	Description
`host`	string	`"127.0.0.1"`	Bind address. Use `"0.0.0.0"` for Docker or external access.
`port`	u16	`8930`	Listen port.

[database]

Key	Type	Default	Description
`url`	string?	none	PostgreSQL connection URL. If omitted, records are stored in-memory only.
`max_connections`	u32	`5`	Maximum database connection pool size.
`store_bodies`	bool	`true`	Whether to persist full request/response bodies.

[update]

Key	Type	Default	Description
`mode`	string	`"manual"`	`"auto"` checks for updates on startup. `"manual"` requires the `llm-gateway update` command.

[sanitization]

Key	Type	Default	Description
`enabled`	bool	`false`	Enable the input sanitization pipeline.
`scan_credentials`	bool	`true`	Scan prompts for API keys, tokens, and connection strings.
`redact_pii`	bool	`false`	Redact personally-identifiable information (emails, phone numbers, SSNs).
`compliance_gate`	bool	`false`	Enable the output compliance gate that scans responses.
`max_classification`	string?	none	Maximum allowed data classification level (`public`, `internal`, `cui`, `itar`, `classified`).

[resilience]

[resilience.default_retry_policy]

Key	Type	Default	Description
`max_attempts`	u32	`3`	Maximum number of attempts (including the initial request).
`base_delay_ms`	u64	`200`	Base delay in milliseconds for the first retry.
`max_delay_ms`	u64	`5000`	Maximum delay cap between retries.
`jitter`	bool	`true`	Use decorrelated jitter (AWS-style) to spread retry load.
`retryable_statuses`	[u16]	`[408, 429, 500, 502, 503, 504]`	HTTP status codes that trigger a retry.
`honor_retry_after`	bool	`true`	Respect the `Retry-After` header from providers.

[resilience.timeout]

Key	Type	Default	Description
`connect_ms`	u64	`5000`	TCP connection timeout in milliseconds.
`default_total_ms`	u64	`300000`	Total request timeout (5 minutes). Override per-request via `x-gateway-timeout`.

[resilience.default_circuit_breaker]

Key	Type	Default	Description
`failure_threshold`	u32	`5`	Failures within the window that trigger the circuit to open.
`failure_window_secs`	u64	`60`	Sliding window for counting failures.
`open_duration_secs`	u64	`30`	How long the circuit stays open before transitioning to half-open.
`half_open_max_concurrent`	u32	`1`	Number of probe requests allowed in half-open state.

[resilience.hedge]

Key	Type	Default	Description
`enabled`	bool	`false`	Enable hedged requests for the RealTime routing profile.
`hedge_delay_ms`	u64	`300`	Delay before firing the hedged (second) request.

[cache]

Key	Type	Default	Description
`enabled`	bool	`false`	Enable exact-match response caching.
`max_entries`	usize	`1000`	Maximum number of cached responses (LRU eviction).
`ttl_secs`	u64	`3600`	Time-to-live for cached entries in seconds.
`temperature_threshold`	f64	`0.3`	Only cache responses when `temperature <= threshold`. Higher temperatures produce non-deterministic output.

[rate_limit]

Key	Type	Default	Description
`enabled`	bool	`false`	Enable per-tenant rate limiting.
`default_limits.rpm`	u32	`60`	Requests per minute.
`default_limits.tpm`	u32	`100000`	Tokens per minute.
`default_limits.rpd`	u32	`10000`	Requests per day.
`default_limits.concurrent`	u32	`10`	Maximum concurrent in-flight requests.

Per-tenant overrides use the [rate_limit.tenant_limits.<tenant_id>] section with the same fields.

[provider_quota]

Key	Type	Default	Description
`enabled`	bool	`false`	Enable provider-level quota tracking.
`deprioritize_threshold`	f64	`0.9`	Fraction of quota at which the provider is deprioritized in routing (0.9 = 90%).
`provider_limits.<id>.rpm`	u32	`1000`	Per-provider requests per minute limit.
`provider_limits.<id>.tpm`	u32	`1000000`	Per-provider tokens per minute limit.

[model_routing]

Key	Type	Default	Description
`sticky_ttl_secs`	u64	`900`	How long (in seconds) a conversation sticks to the same model. Set via `x-gateway-conversation-id` header.

[model_routing.aliases]

A key-value map of alias names to real model IDs:

[model_routing.aliases]
fast = "groq/llama-3.3-70b"
best = "anthropic/claude-opus-4"
code = "anthropic/claude-sonnet-4"

[[model_routing.deprecations]]

A list of deprecated models with sunset dates and optional replacements:

[[model_routing.deprecations]]
model = "anthropic/claude-3-opus"
sunset_date = "2026-06-01"
replacement = "anthropic/claude-opus-4"

Before the sunset date, the gateway adds an x-gateway-deprecation response header as a warning. After the sunset date, requests are automatically redirected to the replacement model.

[[providers]]

Key	Type	Required	Description
`id`	string	yes	Unique identifier for this provider instance (e.g., `"anthropic"`, `"openai"`).
`provider_type`	string	yes	One of: `anthropic`, `openai`, `google`, `deep_infra`, `groq`, `azure_openai`, `local`.
`api_key`	string?	no	API key for the provider. Not needed for local models.
`endpoint`	string?	no	Override the provider's default API endpoint.
`deployment`	string?	no	Azure OpenAI deployment name.
`api_version`	string?	no	Azure OpenAI API version.
`enabled`	bool	no	Default: `true`. Set to `false` to disable without removing.
`retry_policy`	object?	no	Override the default retry policy for this provider.
`circuit_breaker_config`	object?	no	Override the default circuit breaker for this provider.

Default endpoints by provider type:

Provider Type	Default Endpoint
`anthropic`	`https://api.anthropic.com`
`openai`	`https://api.openai.com/v1`
`google`	`https://generativelanguage.googleapis.com/v1beta/openai`
`deep_infra`	`https://api.deepinfra.com/v1/openai`
`groq`	`https://api.groq.com/openai/v1`
`azure_openai`	Must be specified (includes deployment)
`local`	Must be specified

[key_map.*]

Virtual API key mapping for multi-tenant setups. Each entry maps a virtual key to provider-specific keys:

[key_map.team-alpha-key]
anthropic_key = "sk-ant-..."
openai_key = "sk-..."
label = "Team Alpha"

[key_map.team-beta-key]
anthropic_key = "sk-ant-..."
openai_key = "sk-..."
label = "Team Beta"

Clients authenticate with the virtual key in the Authorization header. The gateway resolves the correct provider key at proxy time and uses the virtual key as the tenant ID for rate limiting.

Full Example Configuration

[server]
host = "127.0.0.1"
port = 8930

[database]
url = "postgres://llm_gateway:llm_gateway@localhost/llm_gateway"
max_connections = 5
store_bodies = true

[update]
mode = "manual"

[sanitization]
enabled = true
scan_credentials = true
redact_pii = true
compliance_gate = true
max_classification = "cui"

[resilience.default_retry_policy]
max_attempts = 3
base_delay_ms = 200
max_delay_ms = 5000
jitter = true
retryable_statuses = [408, 429, 500, 502, 503, 504]
honor_retry_after = true

[resilience.timeout]
connect_ms = 5000
default_total_ms = 300000

[resilience.default_circuit_breaker]
failure_threshold = 5
failure_window_secs = 60
open_duration_secs = 30
half_open_max_concurrent = 1

[resilience.hedge]
enabled = true
hedge_delay_ms = 300

[cache]
enabled = true
max_entries = 1000
ttl_secs = 3600
temperature_threshold = 0.3

[rate_limit]
enabled = true

[rate_limit.default_limits]
rpm = 60
tpm = 100000
rpd = 10000
concurrent = 10

[rate_limit.tenant_limits.premium-team]
rpm = 300
tpm = 500000
rpd = 50000
concurrent = 25

[provider_quota]
enabled = true
deprioritize_threshold = 0.9

[provider_quota.provider_limits.openai]
rpm = 500
tpm = 2000000

[provider_quota.provider_limits.anthropic]
rpm = 400
tpm = 1000000

[model_routing]
sticky_ttl_secs = 900

[model_routing.aliases]
fast = "groq/llama-3.3-70b"
best = "anthropic/claude-opus-4"
code = "anthropic/claude-sonnet-4"

[[model_routing.deprecations]]
model = "anthropic/claude-3-opus"
sunset_date = "2026-06-01"
replacement = "anthropic/claude-opus-4"

[[providers]]
id = "anthropic"
provider_type = "anthropic"
api_key = "sk-ant-your-key-here"

[[providers]]
id = "openai"
provider_type = "openai"
api_key = "sk-your-key-here"

[[providers]]
id = "google"
provider_type = "google"
api_key = "AIza-your-key-here"

[[providers]]
id = "groq"
provider_type = "groq"
api_key = "gsk_your-key-here"

[[providers]]
id = "deepinfra"
provider_type = "deep_infra"
api_key = "your-deepinfra-key"

[[providers]]
id = "local-ollama"
provider_type = "local"
endpoint = "http://localhost:11434/v1"

[key_map.team-alpha-key]
anthropic_key = "sk-ant-alpha-key"
openai_key = "sk-alpha-key"
label = "Team Alpha"

4. Routing Engine

The routing engine selects the best model for each request by running a 5-layer pipeline. Each layer narrows the candidate pool until a ranked list of up to 3 models is returned as a failover chain.

5-Layer Routing Pipeline

Layer	Name	Purpose
1	Data Classification Filter	Enforces the `x-data-classification` header. Models whose authorized classification levels do not meet the request's classification are eliminated. Classification levels from lowest to highest: `public`, `internal`, `cui`, `itar`, `classified`.
2	Capability Filter	Removes models that cannot handle the request's requirements. Checks include: tool use, vision, JSON mode, extended thinking, streaming, and any required capabilities specified by the routing profile. Prevents silent downgrades.
3	Quality Scoring	Weights surviving models by quality signals (coding score, reasoning score, instruction following, arena ELO) according to the profile's quality weight (`low`=0.1, `medium`=0.3, `high`=0.6, `very_high`=0.9).
4	Cost Scoring	Weights models by price per token. Lower-cost models score higher. The profile's cost weight controls the tradeoff between cost and quality.
5	Latency Filter	Applies TTFT (Time to First Token) thresholds based on the profile's latency tolerance. `realtime`=500ms, `interactive`=2000ms, `background` and `flexible`=no limit. Models with median TTFT above the threshold are eliminated.

Routing Profiles

Routing profiles control how the pipeline weights quality, cost, and latency. Select a profile per-request with the x-routing-profile header, or let the gateway use the default (balanced).

Profile	Quality	Cost	Latency	Best For
`realtime`	Medium	Low	Realtime (≤500ms TTFT)	Autocomplete, inline suggestions, chat with sub-second response
`interactive`	High	Medium	Interactive (≤2s TTFT)	Standard chat, Q&A, moderate-complexity tasks
`batch`	Medium	High	Background (no limit)	Bulk processing, data extraction, overnight jobs
`cost_optimized`	Low	Very High	Flexible	High-volume, cost-sensitive workloads
`quality_optimized`	Very High	Low	Background	Complex reasoning, legal/medical analysis, critical outputs
`balanced`	High	Medium	Interactive	General-purpose default
`reasoning`	Very High	Low	Background	Math, logic, multi-step reasoning (prefers extended thinking)
`creative`	High	Medium	Interactive	Writing, brainstorming, content generation
`code`	Very High	Medium	Interactive	Code generation, review, debugging

Failover Chains

The routing engine does not return a single model. It returns a ranked list of up to 3 candidates forming a failover chain. The proxy handler walks the chain in order:

Try candidate #1 with the full retry policy (up to max_attempts).
If all retries fail (or the circuit breaker is open), move to candidate #2.
If candidate #2 also fails, try candidate #3.
If the entire chain is exhausted, return the last error to the client.

Models with an open circuit breaker are automatically filtered out of the chain, so traffic is never sent to a known-unhealthy provider.

Routing Simulator

The portal includes an interactive Routing Simulator tab that lets you visualize how the 5-layer pipeline processes a request. Open the portal at http://localhost:8930 and click the Simulator tab.

Using the Simulator

Select a profile — choose a predefined routing profile (e.g., Coding, Architecture) to pre-fill quality weight, cost weight, latency tolerance, and required capabilities. Select “Custom” to configure each parameter manually.
Set data classification — choose the data sensitivity level (Public, Internal, CUI, ITAR, Classified). Models not authorized for the selected level are eliminated in Layer 1.
Check required capabilities — select capabilities the model must support (tool use, streaming, extended thinking, vision, JSON mode, citations, embeddings, batch API). Models lacking any checked capability are eliminated in Layer 2.
Adjust weights — tune quality weight, cost weight, latency tolerance, and reasoning mode to control how the pipeline scores and ranks surviving models.
Set max cost (optional) — enter a dollar amount per request. Models whose estimated cost exceeds this threshold are eliminated in Layer 4.
Click “Run Simulation” — the simulator sends your parameters plus the live model catalog to the POST /v1/route endpoint and displays the results.

Reading the Results

The results panel shows each routing layer as a step:

Layer 1 — Compliance Filter: models authorized for the selected data classification. Eliminated models are shown with strikethrough.
Layer 2 — Capability Filter: models with all required capabilities. If all models are filtered, the engine falls back to the Layer 1 set (shown as an orange warning).
Layer 3 — Quality & Latency Scoring: surviving models ranked by quality score, with models exceeding the TTFT deadline eliminated.
Layer 4 — Cost Scoring: models re-ranked by composite score: (quality × quality_weight) − (cost × cost_weight) + cache_bonus.
Layer 5 — Adaptive Signals: final ranking adjusted by real-time performance data (if available). When no signals exist, the ranking from Layer 4 is preserved.

At the bottom, the Failover Chain shows the top 3 models that would be tried in order: primary, fallback 1, and fallback 2. Each model chip shows its provider badge and display name, with rank numbers indicating position.

Analytics Dashboard

The Analytics tab provides usage, cost, and quality insights across all gateway traffic. It requires a PostgreSQL database to be configured — without one, the tab shows a graceful "no data" state.

Time Period Selection

Use the period buttons at the top right to filter data: 24h, 7 Days, 30 Days, or All Time. The dashboard auto-selects the appropriate time bucket (hourly for 24h, daily for longer periods).

Summary Cards

Five cards show aggregate metrics for the selected period:

Requests — total number of proxy requests
Total Cost — sum of estimated cost across all requests
Tokens — combined input and output token count
Avg Latency — mean end-to-end latency
Error Rate — percentage of requests that returned errors

Usage Charts

Usage Over Time — bar chart showing request volume per time bucket. Hover over a bar to see the exact request count, cost, and average latency.
Usage by Model — horizontal bar chart ranking models by request count (top 10).
Cost by Provider — horizontal bar chart ranking providers by total cost (top 10).
Top Projects — table with per-project breakdown of requests, tokens, cost, latency, and error rate.

Model Quality

The quality table aggregates adaptive routing signals reported via POST /v1/signals. Metrics per model include:

Error Rate — percentage of error signals (green ≤ 5%, orange ≤ 15%, red > 15%)
Avg Latency — mean latency signal (green ≤ 1s, orange ≤ 3s, red > 3s)
Acceptance — acceptance rate (green ≥ 80%, orange ≥ 50%, red < 50%)
Parse Success — structured output parse success rate
Gov Violations — governance violation rate

Natural-Language Query

At the bottom of the Analytics tab, the Ask a Question section lets you query your usage data in natural language. Type a question like “What are the top 5 most expensive models this week?” and click Ask.

Behind the scenes, the gateway sends your question to its own LLM proxy, which generates a SQL query against the proxy_records and quality_signals tables. The query is validated for safety (SELECT-only, table whitelist, no destructive keywords) and executed in a read-only transaction with a 5-second timeout. Results are displayed as a table below the generated SQL.

Requirements: A PostgreSQL database must be configured, and at least one LLM provider must be available to generate the SQL.

5. Resilience Features

Retry Policy

Every provider request is wrapped in a configurable retry loop:

Decorrelated jitter (AWS-style): each retry delay is randomized between base_delay and min(max_delay, prev_delay * 3). This spreads retry traffic and avoids thundering herds.
Retry-After support: when a provider returns a Retry-After header (common with 429 responses), the gateway respects it instead of using the computed delay.
Per-provider override: each [[providers]] entry can specify its own retry_policy that overrides the global default.
Configurable retryable statuses: by default, only 408, 429, 500, 502, 503, and 504 trigger retries.

Circuit Breaker

A 3-state circuit breaker tracks health per provider per model:

State	Behavior
Closed (normal)	Requests flow through. Failures are counted in a sliding window.
Open	All requests are immediately rejected (no network call). Entered when failures exceed the threshold within the window. Lasts for `open_duration_secs`.
Half-Open	After the open duration, a limited number of probe requests (`half_open_max_concurrent`) are allowed through. If probes succeed, the circuit closes. If they fail, it re-opens.

Defaults: 5 failures in 60 seconds opens the circuit for 30 seconds, with 1 concurrent probe in half-open state.

Failover Chains

As described in the Routing Engine section, every routing decision produces a ranked chain of up to 3 candidates. The proxy walks the chain with full retry per step. Circuit-open models are filtered from chains before they are tried.

Hedged Requests

For the realtime routing profile only, the gateway can fire a hedged (second) request to the next candidate in the failover chain after a configurable delay. The first response to arrive wins; the other is discarded.

This is implemented using tokio::select! to race both futures. The hedge fires only when:

Hedging is enabled in [resilience.hedge]
The routing profile is realtime
There is at least a second candidate in the failover chain

Default hedge delay: 300ms.

Timeout Tiers

Tier	Default	Description
Connect	5,000ms	TCP connection timeout to the provider.
TTFT	500ms–2,000ms	Time to first token, based on the routing profile's latency tolerance. Used during routing to filter slow models.
Total	300,000ms (5 min)	Total end-to-end timeout. Override per-request with `x-gateway-timeout` header (value in seconds).
Client disconnect	—	When the client closes the connection, the gateway cancels the upstream request (for non-streaming requests).

6. Format Translation

The gateway translates between provider wire formats using a Canonical Intermediate Representation (IR). This means you can send an OpenAI-format request and have it routed to Anthropic, or vice versa, with no code changes.

Canonical IR

Internally, every request and response passes through canonical types:

CanonicalRequest — provider-agnostic request with model, messages, tools, temperature, max tokens, etc.
CanonicalResponse — provider-agnostic response with content blocks, usage, stop reason, and model info.
CanonicalContent — text, tool use, tool result, and image content blocks.
CanonicalEvent — streaming event types (content delta, tool call delta, message start/stop, usage).

Translation Pipeline

Inbound: The incoming request (OpenAI or Anthropic format) is parsed into a CanonicalRequest.
Routing: The routing engine selects the best model/provider.
Outbound: The CanonicalRequest is serialized into the target provider's wire format.
Response: The provider's response is parsed back into a CanonicalResponse and serialized into the client's expected format.

Streaming SSE Translation

For streaming requests, each SSE event from the provider is parsed into a CanonicalEvent, then emitted in the client's expected SSE format. The translation is bidirectional across OpenAI, Anthropic, and Google formats.

Tool Call Translation

Tool calls and tool results are translated across all supported formats with tool call ID preservation. A tool call made in OpenAI format can be correctly resolved when the response comes from Anthropic, and vice versa.

Capability Gates

Each provider adapter reports its capabilities (tool use, vision, JSON mode, streaming, etc.). The routing engine uses these to prevent silent downgrades — if a request requires tool use, models that do not support it are eliminated before routing.

7. Security & Compliance

Input Sanitization

When sanitization.enabled = true, every prompt is scanned before it reaches the provider:

Credential scanning (scan_credentials): detects API keys, tokens, connection strings, and other secrets. Matches are redacted with [REDACTED].
PII redaction (redact_pii): detects email addresses, phone numbers, Social Security numbers, and similar PII patterns.

Output Compliance Gate

When sanitization.compliance_gate = true, every response (including streaming chunks) is scanned for:

Leaked credentials (API keys, tokens, connection strings)
PII in model output
Content that violates the configured classification level

When a violation is detected, the gateway takes one of three actions based on the ViolationAction:

Action	Behavior
`Block`	The entire response is blocked and a compliance error is returned to the client.
`Redact`	The offending content is replaced with `[REDACTED]` and the response is delivered.
`LogOnly`	The violation is logged and recorded in the routing trace, but the response is delivered unmodified.

Data Classification Enforcement

The x-data-classification request header declares the sensitivity level of the data being sent. The routing engine ensures the request is only routed to providers and models authorized for that classification level.

Classification levels (lowest to highest):

public — publicly available data
internal — internal business data (default)
cui — Controlled Unclassified Information
itar — International Traffic in Arms Regulations data
classified — classified information

Streaming Buffer

For streaming responses, the compliance gate buffers SSE events and scans in real-time, ensuring that sensitive content is caught before it reaches the client.

8. Observability

Prometheus Metrics

Available at GET /metrics in Prometheus exposition format. Key metrics include:

Metric	Type	Description
`gateway_requests_total`	Counter	Total requests by provider, model, and status code.
`gateway_request_duration_seconds`	Histogram	End-to-end request latency by provider.
`gateway_tokens_total`	Counter	Token counts by direction (input/output) and provider.
`gateway_cost_dollars`	Counter	Estimated cost in USD by provider and model.
`gateway_cache_hits_total`	Counter	Cache hit count.
`gateway_cache_misses_total`	Counter	Cache miss count.
`gateway_failovers_total`	Counter	Failover events by source and target provider.

Structured JSON Logging

Every request produces a single structured JSON log line (RequestLogEntry) containing:

Request ID, timestamp, duration
Provider and model selected
Token counts (input, output)
Estimated cost
HTTP status code
Routing profile used
Whether cache was hit
Number of failover attempts
Data classification level
Tenant ID (if applicable)

Logging uses the standard tracing framework. Configure the log level via the RUST_LOG environment variable (e.g., RUST_LOG=info).

Routing Audit Trail

Every request generates a full RoutingTrace that records:

Which models entered and survived each routing pipeline layer
Why specific models were eliminated
The failover chain that was constructed
Each provider attempt (status, latency, errors)
Which model ultimately handled the request
Any compliance findings

Query traces via GET /v1/audit (paginated list) and GET /v1/audit/{request_id} (full detail).

9. API Reference

Endpoints

Method	Path	Description
`POST`	`/v1/messages`	Anthropic-format proxy. Send Anthropic Messages API requests; the gateway routes to the best provider.
`POST`	`/v1/chat/completions`	OpenAI-format proxy. Send OpenAI Chat Completions requests; the gateway routes to the best provider.
`GET`	`/v1/models`	List available models (OpenAI-compatible format).
`GET`	`/v1/catalog`	Full model catalog with detailed model cards (capabilities, pricing, quality signals). Reflects the live runtime catalog: startup seed plus any sync results (B-042).
`POST`	`/v1/catalog/sync`	Trigger an immediate catalog sync against every enabled provider's `/models` endpoint. Returns per-provider `discovered`/`added`/`retired`/`unchanged` counts and errors (B-042).
`GET`	`/v1/catalog/sync-status`	Returns `last_synced` (RFC 3339), `model_count`, `enabled_providers`, and configured `sync_interval_secs`. Used by the portal Last-synced display (B-042).
`GET`	`/v1/routing-profiles`	List all available routing profiles with their configuration.
`GET`	`/v1/providers`	List configured providers and their status.
`POST`	`/v1/providers`	Add a new provider at runtime (does not persist to config file).
`DELETE`	`/v1/providers/{id}`	Remove a provider at runtime.
`GET`	`/v1/projects`	List active projects (groupings of requests).
`POST`	`/v1/route`	Dry-run routing decision. Returns the routing output without making a provider call.
`GET`	`/v1/audit`	List recent routing traces (paginated).
`GET`	`/v1/audit/{request_id}`	Full routing trace for a specific request.
`GET`	`/v1/stats`	Gateway statistics (total requests, tokens, cost, provider breakdown).
`GET`	`/v1/requests`	Paginated request history.
`GET`	`/v1/requests/{id}`	Single request detail with full request/response bodies.
`GET`	`/health`	Health check. Returns `200 OK` with gateway status.
`GET`	`/metrics`	Prometheus-format metrics.
`GET`	`/v1/analytics`	Usage analytics with time-series, per-model, per-provider, and per-project breakdowns. Query params: `from`, `to`, `bucket`, `model`, `provider`, `project`, `source`.
`GET`	`/v1/quality`	Per-model quality metrics aggregated from routing signals.
`GET`	`/v1/quality/{model_card_id}`	Time-series quality metrics for a specific model.
`POST`	`/v1/nl-query`	Natural-language query. Accepts `{"question":"..."}`, generates and executes a read-only SQL query.
`POST`	`/v1/signals`	Record a quality signal. Body: `{"model_card_id","task_type","signal_type","value"}`.
`GET`	`/v1/signals`	Drain and return all pending quality signals.
`POST`	`/v1/alerts`	Check for model degradation alerts from signal aggregations.
`POST`	`/v1/catalog/{model_id}/scores`	Update quality scores for a model. Body: `{"coding_score","reasoning_score","document_score","classification_score","instruction_following","arena_elo"}`. Returns previous and updated scores with rollback history.
`GET`	`/v1/analytics/errors`	Error dashboard with customer-impact classification. Returns errors classified as customer-impacting (request failed) or mitigated (failover succeeded). Query params: `from`, `to`, `model`, `provider`, `source`.
`GET`	`/v1/analytics/requesters`	Per-requester analytics with success rates and model breakdown. Requesters identified by `x-source` header. Query params: `from`, `to`, `model`, `provider`, `source`.
`DELETE`	`/v1/analytics`	Clear all analytics data (proxy records, quality signals, shadow results).
`GET`	`/docs`	Interactive API documentation (Scalar viewer).
`GET`	`/openapi.json`	OpenAPI 3.1 specification (machine-readable).
`GET`	`/pp/version`	PPF protocol version info and negotiation.

Request Headers

Header	Type	Description
`x-routing-profile`	string	Override the routing profile for this request. One of: `realtime`, `interactive`, `batch`, `cost_optimized`, `quality_optimized`, `balanced`, `reasoning`, `creative`, `code`.
`x-data-classification`	string	Data classification level: `public`, `internal`, `cui`, `itar`, `classified`. Default: `internal`.
`x-gateway-timeout`	integer	Override the total request timeout, in seconds.
`x-gateway-failover`	string	Set to `disabled` to disable failover for this request. Only the first candidate will be tried.
`x-gateway-conversation-id`	string	Sticky routing key. Requests with the same conversation ID are routed to the same model for up to `sticky_ttl_secs`.
`x-ppf-protocol-version`	string	PPF protocol version negotiation header.
`Idempotency-Key`	string	Idempotency key for POST proxy requests. Duplicate requests with the same key and tenant return the cached response within TTL (default 24h).
`x-gateway-scope`	string	Scope header for signal authentication. Include `signal-submission` to authorize `POST /v1/signals` when signal auth is required.

Response Headers

Header	Value	Description
`x-gateway-model`	string	The model ID that actually handled the request.
`x-gateway-provider`	string	The provider ID that handled the request.
`x-gateway-cache`	`hit` \| `semantic-hit`	Present when the response was served from exact-match or semantic cache.
`x-gateway-semantic-similarity`	float	Cosine similarity score when served from semantic cache.
`x-gateway-idempotent`	`true`	Present when the response was served from the idempotency cache.
`x-context-window-limit`	integer	Present when the request was rejected or truncated due to context window limits (HTTP 413).
`x-gateway-circuit`	`open`	Present when the request was rejected by a circuit breaker.
`x-gateway-deprecation`	date string	Present when the requested model has a deprecation date. Value is the sunset date.
`x-ppf-negotiated-version`	string	PPF negotiated protocol version.
`x-ppf-deprecated`	`true`	Present when the PPF protocol version used is deprecated.

10. CLI Reference

llm-gateway [OPTIONS] [COMMAND]

Commands:
  serve    Start the gateway server (default if no command given)
  update   Check for and apply updates from GitHub Releases

Global Options:
  -c, --config <PATH>   Path to TOML config file [default: config.toml]
  -p, --port <PORT>     Override the listen port from config
  -h, --help            Print help information
  -V, --version         Print version

Update Options:
  --check               Only check for updates, don't apply them

Examples

# Start with default config
llm-gateway

# Explicitly start the server (same as above)
llm-gateway serve

# Start with a custom config and port
llm-gateway --config /etc/llm-gateway/production.toml --port 9000

# Check for updates without applying
llm-gateway update --check

# Download and apply the latest update
llm-gateway update

Appendix A: Release Notes

v0.5.2 (2026-05-23)

Chat tab now shows the full dynamic catalog (B-055): GET /v1/models previously returned a hardcoded per-ProviderType list (e.g. 2 Anthropic IDs regardless of how many B-042 had discovered). It now reads from state.catalog and emits every active card matching the configured provider. Retired cards are filtered out; deprecated cards remain visible (B-030 reroutes them on send). Local providers still live-query their upstream /v1/models; Azure providers still report the configured deployment.
343 tests (320 library + 23 binary).

v0.5.1 (2026-05-23)

ModelCard consumer audit + 3 silent-fallback fixes (B-052): With B-042's dynamic catalog writing cards whose numeric fields may be at Default, three silent-fallback gaps were latent and would have surfaced as soon as freshly-discovered models entered the runtime catalog. Routing Layer 4 cost now treats zero pricing as "unknown" (0.5 neutral) rather than "free" (0.0); transform_for_anthropic falls back to max_tokens: 4096 when card.max_output_tokens == 0 rather than injecting max_tokens: 0 (Anthropic 400); transform_for_openai skips the clamp when card.max_output_tokens == 0 rather than clamping client values to 0 (OpenAI 400).
Portal Models tab: Context, Max Output, Input $/M, Output $/M cells now render — (muted) for default-zero fields instead of 0 or $0.00. Distinguishes "no data" from "literally zero".
New audit doc: docs/architecture/MODELCARD_CONSUMERS.md documents every ModelCard reader and its behavior on default-valued fields. Captures the cross-cutting principle (D-34): every consumer treats a default-zero numeric field as "unknown," not "the literal value zero".
341 tests (320 library + 21 binary). Clippy clean.

v0.5.0 (2026-05-22)

Dynamic model catalog (B-042): Per-provider /models auto-discovery for OpenAI, Anthropic, Azure, DeepInfra, Groq, Google, and Local. Daily background sync, reactive re-sync on upstream HTTP 404 (rate-limited per provider), operator overrides via [catalog.overrides."{provider_id}/{model_id}"], manual trigger via POST /v1/catalog/sync, status via GET /v1/catalog/sync-status. Portal Models tab has a Last-synced indicator and Sync-now button.
Background auto-update checker (B-043): When [update].mode = "auto", periodic check (check_interval_secs, default 300s) downloads and applies updates, then exits cleanly for launchd/systemd to restart. Production deployments must run under a service manager.
Linux auto-update works (B-044): Added the missing Linux x86_64 arm to platform_asset_name matching the release workflow's artifact. llm-gateway update now works on Linux.
CI fully green (B-045): 91 pre-existing clippy errors cleared via mechanical refactor (R-2.1 preserved). Build, test, and clippy steps all green.
331 tests (313 library + 18 binary).

v0.4.0 (2026-05-21)

Semantic response cache (B-037): Extends exact-match cache with embedding-based similarity search. Configurable cosine similarity threshold (default 0.95), scoped by tenant and model class. Skips requests with tool calls or image attachments. Configure via [cache.semantic].
Adaptive routing maturation (B-038): Exploration strategies (EpsilonGreedy, ThompsonSampling, UCB1) route a configurable percentage of requests to non-top-ranked candidates. Shadow traffic duplicates requests to a challenger model for comparison. Signal authentication via Authorization header or x-gateway-scope: signal-submission. Eval-driven quality scores via POST /v1/catalog/{model_id}/scores with rollback history.
Request idempotency (B-039): Idempotency-Key header on proxy requests caches responses for 24h (configurable). Tenant-scoped with LRU eviction.
Context window enforcement (B-040): Rejects requests exceeding model context window (HTTP 413) with optional truncation mode.
Capability fallback policy (B-041): Per-capability fail_closed/fail_open policy in routing Layer 2. Tool use and vision default to fail_closed; others default to fail_open.
309 tests (291 library + 18 binary).

Full release notes for earlier versions: see docs/reports/release-notes-v{VERSION}.html in the repository.