1. Introduction — Why Use an LLM Gateway?

The LLM Gateway is a self-hosted, intelligent proxy that sits between your applications and LLM providers. Instead of scattering provider-specific code across your codebase, you route all requests through a single endpoint and let the gateway handle the rest.

Single API, many providers

One endpoint for OpenAI, Anthropic, Google, DeepInfra, Groq, Azure OpenAI, and local models. Your application sends an OpenAI-format or Anthropic-format request; the gateway translates and routes it to any configured provider. Switching providers requires zero code changes.

Automatic failover

If a provider goes down, requests transparently route to the next-best model in the failover chain. Circuit breakers detect outages and stop sending traffic to unhealthy providers. Retry with decorrelated jitter handles transient errors. Your users never see a 503.

Cost optimization

Intelligent routing considers cost, quality, and latency to pick the best model for each request. Approximate token counting enforces budget ceilings before requests are sent. Response caching avoids duplicate LLM calls. Cost is tracked per request with full visibility.

Security & compliance

Prompt sanitization strips PII and credentials before they reach providers. An output compliance gate scans every response for leaked secrets and sensitive data. Data classification enforcement ensures classified content only reaches authorized providers. Everything runs on your infrastructure — your data never transits a third party.

Observability

Prometheus-compatible metrics at /metrics, structured JSON logs with one line per request, and full routing audit trails. You always know which provider handled each request, how long it took, and what it cost.

Rate limiting & quotas

Per-tenant RPM/TPM/RPD/concurrent limits protect against abuse. Provider quota tracking avoids 429 storms by deprioritizing providers at 90% utilization and blocking at 100%.

Format translation

Send an OpenAI-format request and have it routed to Anthropic (or vice versa) with automatic format translation — including streaming SSE events, tool calls with ID preservation, and image content.

Model management

Model aliases let you use friendly names like fast or best. Deprecation handling issues warnings before sunset and auto-redirects to replacements after. Conversation-sticky routing keeps multi-turn conversations on the same model.

Self-hosted

Runs on your infrastructure. No vendor lock-in. MIT licensed. A single binary with no runtime dependencies beyond an optional PostgreSQL database.

Back to top

2. Installation

A) Binary Download (recommended)

Download the latest release for your platform using the GitHub CLI. These commands always fetch the most recent version.

Windows (PowerShell)

gh release download --repo quantum-intelligence-group/llm-gateway --pattern 'llm-gateway-windows-x64.zip' --dir $env:TEMP
Expand-Archive "$env:TEMP\llm-gateway-windows-x64.zip" -DestinationPath "$env:USERPROFILE\.llm-gateway" -Force
$env:PATH += ";$env:USERPROFILE\.llm-gateway"

macOS (ARM / Apple Silicon)

gh release download --repo quantum-intelligence-group/llm-gateway --pattern 'llm-gateway-macos-arm64.tar.gz' --dir /tmp
tar -xzf /tmp/llm-gateway-macos-arm64.tar.gz -C /usr/local/bin
chmod +x /usr/local/bin/llm-gateway

Linux (x64)

gh release download --repo quantum-intelligence-group/llm-gateway --pattern 'llm-gateway-linux-x64.tar.gz' --dir /tmp
tar -xzf /tmp/llm-gateway-linux-x64.tar.gz -C /usr/local/bin
chmod +x /usr/local/bin/llm-gateway

B) Docker

Single container

docker pull ghcr.io/quantum-intelligence-group/llm-gateway:latest
docker run -p 8930:8930 -v ./config.toml:/etc/llm-gateway/config.toml ghcr.io/quantum-intelligence-group/llm-gateway:latest

Docker Compose (with Postgres)

docker compose up

The repo includes a docker-compose.yml that builds from source and provisions Postgres.

Docker networking note When running inside Docker, set server.host = "0.0.0.0" in your config.toml so the gateway binds to all interfaces and is reachable through the port mapping. The default 127.0.0.1 only listens inside the container.

C) Build from Source

Prerequisites

git clone https://github.com/quantum-intelligence-group/llm-gateway.git
cd llm-gateway
cargo build --release

The compiled binary will be at target/release/llm-gateway (or target\release\llm-gateway.exe on Windows).

Windows network-drive note If your repo is on a network drive, incremental compilation may fail. Set $env:CARGO_INCREMENTAL = "0" and optionally redirect the target directory to local disk with $env:CARGO_TARGET_DIR = "C:\temp\llm-gateway-target".

Updating

Re-run the same download command — it always fetches the latest release. Or use the built-in self-update:

# Check for updates without applying
llm-gateway update --check

# Download and apply the latest release
llm-gateway update

You can enable automatic update checks at startup in your config:

[update]
mode = "auto"    # "auto" checks on startup; "manual" (default) requires explicit command

After an update is applied, restart the gateway process to use the new version.

Back to top

Quick Start

After building from source, follow these steps to get the gateway running.

1. Create a minimal config

Copy the example config and add at least one provider API key:

cp config.example.toml config.toml

Edit config.toml and add your provider credentials under [[providers]]. At minimum you need one provider:

[server]
host = "127.0.0.1"
port = 8930

[[providers]]
id = "openai"
provider_type = "openai"
api_key = "sk-..."

2. Start the gateway

# Start with config.toml in the current directory
llm-gateway

# Or specify a config file and port explicitly
llm-gateway --config /path/to/config.toml --port 9000

On success, the gateway logs the listen address and configured providers to the console.

3. Verify it's running

curl http://localhost:8930/health

Expected response:

{
  "status": "ok",
  "version": "0.3.1",
  "uptime_secs": 5,
  "port": 8930,
  "providers": ["openai"]
}

4. Send your first request

curl http://localhost:8930/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The gateway proxies the request to the configured provider, applies routing and compliance rules, and returns the response in the same format. See the API Reference for the full list of endpoints and headers.

Back to top

3. Configuration

Configuration is loaded from a TOML file (default: config.toml). Every section is optional and has sensible defaults. You can override the config path with --config <PATH> and the port with --port <PORT>.

[server]

KeyTypeDefaultDescription
hoststring"127.0.0.1"Bind address. Use "0.0.0.0" for Docker or external access.
portu168930Listen port.

[database]

KeyTypeDefaultDescription
urlstring?nonePostgreSQL connection URL. If omitted, records are stored in-memory only.
max_connectionsu325Maximum database connection pool size.
store_bodiesbooltrueWhether to persist full request/response bodies.

[update]

KeyTypeDefaultDescription
modestring"manual""auto" checks for updates on startup. "manual" requires the llm-gateway update command.

[sanitization]

KeyTypeDefaultDescription
enabledboolfalseEnable the input sanitization pipeline.
scan_credentialsbooltrueScan prompts for API keys, tokens, and connection strings.
redact_piiboolfalseRedact personally-identifiable information (emails, phone numbers, SSNs).
compliance_gateboolfalseEnable the output compliance gate that scans responses.
max_classificationstring?noneMaximum allowed data classification level (public, internal, cui, itar, classified).

[resilience]

[resilience.default_retry_policy]

KeyTypeDefaultDescription
max_attemptsu323Maximum number of attempts (including the initial request).
base_delay_msu64200Base delay in milliseconds for the first retry.
max_delay_msu645000Maximum delay cap between retries.
jitterbooltrueUse decorrelated jitter (AWS-style) to spread retry load.
retryable_statuses[u16][408, 429, 500, 502, 503, 504]HTTP status codes that trigger a retry.
honor_retry_afterbooltrueRespect the Retry-After header from providers.

[resilience.timeout]

KeyTypeDefaultDescription
connect_msu645000TCP connection timeout in milliseconds.
default_total_msu64300000Total request timeout (5 minutes). Override per-request via x-gateway-timeout.

[resilience.default_circuit_breaker]

KeyTypeDefaultDescription
failure_thresholdu325Failures within the window that trigger the circuit to open.
failure_window_secsu6460Sliding window for counting failures.
open_duration_secsu6430How long the circuit stays open before transitioning to half-open.
half_open_max_concurrentu321Number of probe requests allowed in half-open state.

[resilience.hedge]

KeyTypeDefaultDescription
enabledboolfalseEnable hedged requests for the RealTime routing profile.
hedge_delay_msu64300Delay before firing the hedged (second) request.

[cache]

KeyTypeDefaultDescription
enabledboolfalseEnable exact-match response caching.
max_entriesusize1000Maximum number of cached responses (LRU eviction).
ttl_secsu643600Time-to-live for cached entries in seconds.
temperature_thresholdf640.3Only cache responses when temperature <= threshold. Higher temperatures produce non-deterministic output.

[rate_limit]

KeyTypeDefaultDescription
enabledboolfalseEnable per-tenant rate limiting.
default_limits.rpmu3260Requests per minute.
default_limits.tpmu32100000Tokens per minute.
default_limits.rpdu3210000Requests per day.
default_limits.concurrentu3210Maximum concurrent in-flight requests.

Per-tenant overrides use the [rate_limit.tenant_limits.<tenant_id>] section with the same fields.

[provider_quota]

KeyTypeDefaultDescription
enabledboolfalseEnable provider-level quota tracking.
deprioritize_thresholdf640.9Fraction of quota at which the provider is deprioritized in routing (0.9 = 90%).
provider_limits.<id>.rpmu321000Per-provider requests per minute limit.
provider_limits.<id>.tpmu321000000Per-provider tokens per minute limit.

[model_routing]

KeyTypeDefaultDescription
sticky_ttl_secsu64900How long (in seconds) a conversation sticks to the same model. Set via x-gateway-conversation-id header.

[model_routing.aliases]

A key-value map of alias names to real model IDs:

[model_routing.aliases]
fast = "groq/llama-3.3-70b"
best = "anthropic/claude-opus-4"
code = "anthropic/claude-sonnet-4"

[[model_routing.deprecations]]

A list of deprecated models with sunset dates and optional replacements:

[[model_routing.deprecations]]
model = "anthropic/claude-3-opus"
sunset_date = "2026-06-01"
replacement = "anthropic/claude-opus-4"

Before the sunset date, the gateway adds an x-gateway-deprecation response header as a warning. After the sunset date, requests are automatically redirected to the replacement model.

[[providers]]

KeyTypeRequiredDescription
idstringyesUnique identifier for this provider instance (e.g., "anthropic", "openai").
provider_typestringyesOne of: anthropic, openai, google, deep_infra, groq, azure_openai, local.
api_keystring?noAPI key for the provider. Not needed for local models.
endpointstring?noOverride the provider's default API endpoint.
deploymentstring?noAzure OpenAI deployment name.
api_versionstring?noAzure OpenAI API version.
enabledboolnoDefault: true. Set to false to disable without removing.
retry_policyobject?noOverride the default retry policy for this provider.
circuit_breaker_configobject?noOverride the default circuit breaker for this provider.

Default endpoints by provider type:

Provider TypeDefault Endpoint
anthropichttps://api.anthropic.com
openaihttps://api.openai.com/v1
googlehttps://generativelanguage.googleapis.com/v1beta/openai
deep_infrahttps://api.deepinfra.com/v1/openai
groqhttps://api.groq.com/openai/v1
azure_openaiMust be specified (includes deployment)
localMust be specified

[key_map.*]

Virtual API key mapping for multi-tenant setups. Each entry maps a virtual key to provider-specific keys:

[key_map.team-alpha-key]
anthropic_key = "sk-ant-..."
openai_key = "sk-..."
label = "Team Alpha"

[key_map.team-beta-key]
anthropic_key = "sk-ant-..."
openai_key = "sk-..."
label = "Team Beta"

Clients authenticate with the virtual key in the Authorization header. The gateway resolves the correct provider key at proxy time and uses the virtual key as the tenant ID for rate limiting.

Full Example Configuration

[server]
host = "127.0.0.1"
port = 8930

[database]
url = "postgres://llm_gateway:llm_gateway@localhost/llm_gateway"
max_connections = 5
store_bodies = true

[update]
mode = "manual"

[sanitization]
enabled = true
scan_credentials = true
redact_pii = true
compliance_gate = true
max_classification = "cui"

[resilience.default_retry_policy]
max_attempts = 3
base_delay_ms = 200
max_delay_ms = 5000
jitter = true
retryable_statuses = [408, 429, 500, 502, 503, 504]
honor_retry_after = true

[resilience.timeout]
connect_ms = 5000
default_total_ms = 300000

[resilience.default_circuit_breaker]
failure_threshold = 5
failure_window_secs = 60
open_duration_secs = 30
half_open_max_concurrent = 1

[resilience.hedge]
enabled = true
hedge_delay_ms = 300

[cache]
enabled = true
max_entries = 1000
ttl_secs = 3600
temperature_threshold = 0.3

[rate_limit]
enabled = true

[rate_limit.default_limits]
rpm = 60
tpm = 100000
rpd = 10000
concurrent = 10

[rate_limit.tenant_limits.premium-team]
rpm = 300
tpm = 500000
rpd = 50000
concurrent = 25

[provider_quota]
enabled = true
deprioritize_threshold = 0.9

[provider_quota.provider_limits.openai]
rpm = 500
tpm = 2000000

[provider_quota.provider_limits.anthropic]
rpm = 400
tpm = 1000000

[model_routing]
sticky_ttl_secs = 900

[model_routing.aliases]
fast = "groq/llama-3.3-70b"
best = "anthropic/claude-opus-4"
code = "anthropic/claude-sonnet-4"

[[model_routing.deprecations]]
model = "anthropic/claude-3-opus"
sunset_date = "2026-06-01"
replacement = "anthropic/claude-opus-4"

[[providers]]
id = "anthropic"
provider_type = "anthropic"
api_key = "sk-ant-your-key-here"

[[providers]]
id = "openai"
provider_type = "openai"
api_key = "sk-your-key-here"

[[providers]]
id = "google"
provider_type = "google"
api_key = "AIza-your-key-here"

[[providers]]
id = "groq"
provider_type = "groq"
api_key = "gsk_your-key-here"

[[providers]]
id = "deepinfra"
provider_type = "deep_infra"
api_key = "your-deepinfra-key"

[[providers]]
id = "local-ollama"
provider_type = "local"
endpoint = "http://localhost:11434/v1"

[key_map.team-alpha-key]
anthropic_key = "sk-ant-alpha-key"
openai_key = "sk-alpha-key"
label = "Team Alpha"
Back to top

4. Routing Engine

The routing engine selects the best model for each request by running a 5-layer pipeline. Each layer narrows the candidate pool until a ranked list of up to 3 models is returned as a failover chain.

5-Layer Routing Pipeline

LayerNamePurpose
1 Data Classification Filter Enforces the x-data-classification header. Models whose authorized classification levels do not meet the request's classification are eliminated. Classification levels from lowest to highest: public, internal, cui, itar, classified.
2 Capability Filter Removes models that cannot handle the request's requirements. Checks include: tool use, vision, JSON mode, extended thinking, streaming, and any required capabilities specified by the routing profile. Prevents silent downgrades.
3 Quality Scoring Weights surviving models by quality signals (coding score, reasoning score, instruction following, arena ELO) according to the profile's quality weight (low=0.1, medium=0.3, high=0.6, very_high=0.9).
4 Cost Scoring Weights models by price per token. Lower-cost models score higher. The profile's cost weight controls the tradeoff between cost and quality.
5 Latency Filter Applies TTFT (Time to First Token) thresholds based on the profile's latency tolerance. realtime=500ms, interactive=2000ms, background and flexible=no limit. Models with median TTFT above the threshold are eliminated.

Routing Profiles

Routing profiles control how the pipeline weights quality, cost, and latency. Select a profile per-request with the x-routing-profile header, or let the gateway use the default (balanced).

ProfileQualityCostLatencyBest For
realtimeMediumLowRealtime (≤500ms TTFT)Autocomplete, inline suggestions, chat with sub-second response
interactiveHighMediumInteractive (≤2s TTFT)Standard chat, Q&A, moderate-complexity tasks
batchMediumHighBackground (no limit)Bulk processing, data extraction, overnight jobs
cost_optimizedLowVery HighFlexibleHigh-volume, cost-sensitive workloads
quality_optimizedVery HighLowBackgroundComplex reasoning, legal/medical analysis, critical outputs
balancedHighMediumInteractiveGeneral-purpose default
reasoningVery HighLowBackgroundMath, logic, multi-step reasoning (prefers extended thinking)
creativeHighMediumInteractiveWriting, brainstorming, content generation
codeVery HighMediumInteractiveCode generation, review, debugging

Failover Chains

The routing engine does not return a single model. It returns a ranked list of up to 3 candidates forming a failover chain. The proxy handler walks the chain in order:

  1. Try candidate #1 with the full retry policy (up to max_attempts).
  2. If all retries fail (or the circuit breaker is open), move to candidate #2.
  3. If candidate #2 also fails, try candidate #3.
  4. If the entire chain is exhausted, return the last error to the client.

Models with an open circuit breaker are automatically filtered out of the chain, so traffic is never sent to a known-unhealthy provider.

Routing Simulator

The portal includes an interactive Routing Simulator tab that lets you visualize how the 5-layer pipeline processes a request. Open the portal at http://localhost:8930 and click the Simulator tab.

Using the Simulator

  1. Select a profile — choose a predefined routing profile (e.g., Coding, Architecture) to pre-fill quality weight, cost weight, latency tolerance, and required capabilities. Select “Custom” to configure each parameter manually.
  2. Set data classification — choose the data sensitivity level (Public, Internal, CUI, ITAR, Classified). Models not authorized for the selected level are eliminated in Layer 1.
  3. Check required capabilities — select capabilities the model must support (tool use, streaming, extended thinking, vision, JSON mode, citations, embeddings, batch API). Models lacking any checked capability are eliminated in Layer 2.
  4. Adjust weights — tune quality weight, cost weight, latency tolerance, and reasoning mode to control how the pipeline scores and ranks surviving models.
  5. Set max cost (optional) — enter a dollar amount per request. Models whose estimated cost exceeds this threshold are eliminated in Layer 4.
  6. Click “Run Simulation” — the simulator sends your parameters plus the live model catalog to the POST /v1/route endpoint and displays the results.

Reading the Results

The results panel shows each routing layer as a step:

At the bottom, the Failover Chain shows the top 3 models that would be tried in order: primary, fallback 1, and fallback 2. Each model chip shows its provider badge and display name, with rank numbers indicating position.

Analytics Dashboard

The Analytics tab provides usage, cost, and quality insights across all gateway traffic. It requires a PostgreSQL database to be configured — without one, the tab shows a graceful "no data" state.

Time Period Selection

Use the period buttons at the top right to filter data: 24h, 7 Days, 30 Days, or All Time. The dashboard auto-selects the appropriate time bucket (hourly for 24h, daily for longer periods).

Summary Cards

Five cards show aggregate metrics for the selected period:

Usage Charts

Model Quality

The quality table aggregates adaptive routing signals reported via POST /v1/signals. Metrics per model include:

Natural-Language Query

At the bottom of the Analytics tab, the Ask a Question section lets you query your usage data in natural language. Type a question like “What are the top 5 most expensive models this week?” and click Ask.

Behind the scenes, the gateway sends your question to its own LLM proxy, which generates a SQL query against the proxy_records and quality_signals tables. The query is validated for safety (SELECT-only, table whitelist, no destructive keywords) and executed in a read-only transaction with a 5-second timeout. Results are displayed as a table below the generated SQL.

Requirements: A PostgreSQL database must be configured, and at least one LLM provider must be available to generate the SQL.

Back to top

5. Resilience Features

Retry Policy

Every provider request is wrapped in a configurable retry loop:

Circuit Breaker

A 3-state circuit breaker tracks health per provider per model:

StateBehavior
Closed (normal)Requests flow through. Failures are counted in a sliding window.
OpenAll requests are immediately rejected (no network call). Entered when failures exceed the threshold within the window. Lasts for open_duration_secs.
Half-OpenAfter the open duration, a limited number of probe requests (half_open_max_concurrent) are allowed through. If probes succeed, the circuit closes. If they fail, it re-opens.

Defaults: 5 failures in 60 seconds opens the circuit for 30 seconds, with 1 concurrent probe in half-open state.

Failover Chains

As described in the Routing Engine section, every routing decision produces a ranked chain of up to 3 candidates. The proxy walks the chain with full retry per step. Circuit-open models are filtered from chains before they are tried.

Hedged Requests

For the realtime routing profile only, the gateway can fire a hedged (second) request to the next candidate in the failover chain after a configurable delay. The first response to arrive wins; the other is discarded.

This is implemented using tokio::select! to race both futures. The hedge fires only when:

Default hedge delay: 300ms.

Timeout Tiers

TierDefaultDescription
Connect5,000msTCP connection timeout to the provider.
TTFT500ms–2,000msTime to first token, based on the routing profile's latency tolerance. Used during routing to filter slow models.
Total300,000ms (5 min)Total end-to-end timeout. Override per-request with x-gateway-timeout header (value in seconds).
Client disconnectWhen the client closes the connection, the gateway cancels the upstream request (for non-streaming requests).
Back to top

6. Format Translation

The gateway translates between provider wire formats using a Canonical Intermediate Representation (IR). This means you can send an OpenAI-format request and have it routed to Anthropic, or vice versa, with no code changes.

Canonical IR

Internally, every request and response passes through canonical types:

Translation Pipeline

  1. Inbound: The incoming request (OpenAI or Anthropic format) is parsed into a CanonicalRequest.
  2. Routing: The routing engine selects the best model/provider.
  3. Outbound: The CanonicalRequest is serialized into the target provider's wire format.
  4. Response: The provider's response is parsed back into a CanonicalResponse and serialized into the client's expected format.

Streaming SSE Translation

For streaming requests, each SSE event from the provider is parsed into a CanonicalEvent, then emitted in the client's expected SSE format. The translation is bidirectional across OpenAI, Anthropic, and Google formats.

Tool Call Translation

Tool calls and tool results are translated across all supported formats with tool call ID preservation. A tool call made in OpenAI format can be correctly resolved when the response comes from Anthropic, and vice versa.

Capability Gates

Each provider adapter reports its capabilities (tool use, vision, JSON mode, streaming, etc.). The routing engine uses these to prevent silent downgrades — if a request requires tool use, models that do not support it are eliminated before routing.

Back to top

7. Security & Compliance

Input Sanitization

When sanitization.enabled = true, every prompt is scanned before it reaches the provider:

Output Compliance Gate

When sanitization.compliance_gate = true, every response (including streaming chunks) is scanned for:

When a violation is detected, the gateway takes one of three actions based on the ViolationAction:

ActionBehavior
BlockThe entire response is blocked and a compliance error is returned to the client.
RedactThe offending content is replaced with [REDACTED] and the response is delivered.
LogOnlyThe violation is logged and recorded in the routing trace, but the response is delivered unmodified.

Data Classification Enforcement

The x-data-classification request header declares the sensitivity level of the data being sent. The routing engine ensures the request is only routed to providers and models authorized for that classification level.

Classification levels (lowest to highest):

  1. public — publicly available data
  2. internal — internal business data (default)
  3. cui — Controlled Unclassified Information
  4. itar — International Traffic in Arms Regulations data
  5. classified — classified information

Streaming Buffer

For streaming responses, the compliance gate buffers SSE events and scans in real-time, ensuring that sensitive content is caught before it reaches the client.

Back to top

8. Observability

Prometheus Metrics

Available at GET /metrics in Prometheus exposition format. Key metrics include:

MetricTypeDescription
gateway_requests_totalCounterTotal requests by provider, model, and status code.
gateway_request_duration_secondsHistogramEnd-to-end request latency by provider.
gateway_tokens_totalCounterToken counts by direction (input/output) and provider.
gateway_cost_dollarsCounterEstimated cost in USD by provider and model.
gateway_cache_hits_totalCounterCache hit count.
gateway_cache_misses_totalCounterCache miss count.
gateway_failovers_totalCounterFailover events by source and target provider.

Structured JSON Logging

Every request produces a single structured JSON log line (RequestLogEntry) containing:

Logging uses the standard tracing framework. Configure the log level via the RUST_LOG environment variable (e.g., RUST_LOG=info).

Routing Audit Trail

Every request generates a full RoutingTrace that records:

Query traces via GET /v1/audit (paginated list) and GET /v1/audit/{request_id} (full detail).

Back to top

9. API Reference

Endpoints

MethodPathDescription
POST/v1/messagesAnthropic-format proxy. Send Anthropic Messages API requests; the gateway routes to the best provider.
POST/v1/chat/completionsOpenAI-format proxy. Send OpenAI Chat Completions requests; the gateway routes to the best provider.
GET/v1/modelsList available models (OpenAI-compatible format).
GET/v1/catalogFull model catalog with detailed model cards (capabilities, pricing, quality signals). Reflects the live runtime catalog: startup seed plus any sync results (B-042).
POST/v1/catalog/syncTrigger an immediate catalog sync against every enabled provider's /models endpoint. Returns per-provider discovered/added/retired/unchanged counts and errors (B-042).
GET/v1/catalog/sync-statusReturns last_synced (RFC 3339), model_count, enabled_providers, and configured sync_interval_secs. Used by the portal Last-synced display (B-042).
GET/v1/routing-profilesList all available routing profiles with their configuration.
GET/v1/providersList configured providers and their status.
POST/v1/providersAdd a new provider at runtime (does not persist to config file).
DELETE/v1/providers/{id}Remove a provider at runtime.
GET/v1/projectsList active projects (groupings of requests).
POST/v1/routeDry-run routing decision. Returns the routing output without making a provider call.
GET/v1/auditList recent routing traces (paginated).
GET/v1/audit/{request_id}Full routing trace for a specific request.
GET/v1/statsGateway statistics (total requests, tokens, cost, provider breakdown).
GET/v1/requestsPaginated request history.
GET/v1/requests/{id}Single request detail with full request/response bodies.
GET/healthHealth check. Returns 200 OK with gateway status.
GET/metricsPrometheus-format metrics.
GET/v1/analyticsUsage analytics with time-series, per-model, per-provider, and per-project breakdowns. Query params: from, to, bucket, model, provider, project, source.
GET/v1/qualityPer-model quality metrics aggregated from routing signals.
GET/v1/quality/{model_card_id}Time-series quality metrics for a specific model.
POST/v1/nl-queryNatural-language query. Accepts {"question":"..."}, generates and executes a read-only SQL query.
POST/v1/signalsRecord a quality signal. Body: {"model_card_id","task_type","signal_type","value"}.
GET/v1/signalsDrain and return all pending quality signals.
POST/v1/alertsCheck for model degradation alerts from signal aggregations.
POST/v1/catalog/{model_id}/scoresUpdate quality scores for a model. Body: {"coding_score","reasoning_score","document_score","classification_score","instruction_following","arena_elo"}. Returns previous and updated scores with rollback history.
GET/v1/analytics/errorsError dashboard with customer-impact classification. Returns errors classified as customer-impacting (request failed) or mitigated (failover succeeded). Query params: from, to, model, provider, source.
GET/v1/analytics/requestersPer-requester analytics with success rates and model breakdown. Requesters identified by x-source header. Query params: from, to, model, provider, source.
DELETE/v1/analyticsClear all analytics data (proxy records, quality signals, shadow results).
GET/docsInteractive API documentation (Scalar viewer).
GET/openapi.jsonOpenAPI 3.1 specification (machine-readable).
GET/pp/versionPPF protocol version info and negotiation.

Request Headers

HeaderTypeDescription
x-routing-profilestringOverride the routing profile for this request. One of: realtime, interactive, batch, cost_optimized, quality_optimized, balanced, reasoning, creative, code.
x-data-classificationstringData classification level: public, internal, cui, itar, classified. Default: internal.
x-gateway-timeoutintegerOverride the total request timeout, in seconds.
x-gateway-failoverstringSet to disabled to disable failover for this request. Only the first candidate will be tried.
x-gateway-conversation-idstringSticky routing key. Requests with the same conversation ID are routed to the same model for up to sticky_ttl_secs.
x-ppf-protocol-versionstringPPF protocol version negotiation header.
Idempotency-KeystringIdempotency key for POST proxy requests. Duplicate requests with the same key and tenant return the cached response within TTL (default 24h).
x-gateway-scopestringScope header for signal authentication. Include signal-submission to authorize POST /v1/signals when signal auth is required.

Response Headers

HeaderValueDescription
x-gateway-modelstringThe model ID that actually handled the request.
x-gateway-providerstringThe provider ID that handled the request.
x-gateway-cachehit | semantic-hitPresent when the response was served from exact-match or semantic cache.
x-gateway-semantic-similarityfloatCosine similarity score when served from semantic cache.
x-gateway-idempotenttruePresent when the response was served from the idempotency cache.
x-context-window-limitintegerPresent when the request was rejected or truncated due to context window limits (HTTP 413).
x-gateway-circuitopenPresent when the request was rejected by a circuit breaker.
x-gateway-deprecationdate stringPresent when the requested model has a deprecation date. Value is the sunset date.
x-ppf-negotiated-versionstringPPF negotiated protocol version.
x-ppf-deprecatedtruePresent when the PPF protocol version used is deprecated.
Back to top

10. CLI Reference

llm-gateway [OPTIONS] [COMMAND]

Commands:
  serve    Start the gateway server (default if no command given)
  update   Check for and apply updates from GitHub Releases

Global Options:
  -c, --config <PATH>   Path to TOML config file [default: config.toml]
  -p, --port <PORT>     Override the listen port from config
  -h, --help            Print help information
  -V, --version         Print version

Update Options:
  --check               Only check for updates, don't apply them

Examples

# Start with default config
llm-gateway

# Explicitly start the server (same as above)
llm-gateway serve

# Start with a custom config and port
llm-gateway --config /etc/llm-gateway/production.toml --port 9000

# Check for updates without applying
llm-gateway update --check

# Download and apply the latest update
llm-gateway update
Back to top

Appendix A: Release Notes

v0.5.2 (2026-05-23)

v0.5.1 (2026-05-23)

v0.5.0 (2026-05-22)

v0.4.0 (2026-05-21)

Full release notes for earlier versions: see docs/reports/release-notes-v{VERSION}.html in the repository.

Back to top