Skip to main content

Overview

The routing system automatically selects the best provider and model for every request based on the features your request requires, real-time metrics, and your optimization preferences. It supports:
  • Auto model selection — set model: "auto" and let the system choose
  • Provider routing — specify a model and let the system pick the best provider
  • Multi-provider fallback — if one provider fails, the system retries on the next
  • Feature degradation — if no provider supports all requested features, less important features are gracefully stripped
  • ZDR enforcement — when Zero Data Retention is enabled, only ZDR-supporting providers are considered

Basic Usage

Auto Model Selection

Set model: "auto" to let the system select both the model and provider:
curl https://api.concentrate.ai/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "auto",
    "input": "Explain machine learning"
  }'

Provider Routing (Pinned Model)

Specify a model without a provider prefix and the system routes to the best provider:
{
  "model": "gpt-4o",
  "input": "Hello",
  "routing": {
    "metric": "cost"
  }
}
This selects the cheapest provider for GPT-4o (e.g., OpenAI vs Azure) based on real-time data.

Pinned Provider

Use provider/model format to pin a specific provider. Routing still provides fallback to other providers for the same model if the pinned one fails:
{
  "model": "openai/gpt-4o",
  "input": "Hello"
}

Routing Configuration

Control routing behavior with the routing parameter:
routing.metric
string
default:"performance"
How providers are sorted and selected:Static Metrics:
  • "cost" — Sort by provider pricing (cheapest first)
  • "performance" — Sort by quality/revenue-share tier (best first)
Live Metrics (from Redis):
  • "avg_latency" — Average response time
  • "min_latency", "max_latency" — Min/max response time
  • "p50_latency", "p90_latency", "p99_latency" — Percentile latencies
  • "avg_e2e_latency", "min_e2e_latency", "max_e2e_latency" — End-to-end latency including overhead
  • Any percentile from p0 to p100: "p75_latency", "p95_e2e_latency", etc.
  • "uptime" — Provider availability
  • "throughput" — Requests per second
  • "total_requests" — Total request volume
  • "input_tokens", "output_tokens", "total_tokens" — Average token counts
routing.interval
string
default:"15 minutes"
Time window for live metric calculation. Only applies when using live metrics (latency, uptime, etc.). Static metrics (cost, performance) ignore this.Format: "number unit" or "number<shorthand>"
  • "15 minutes" or "15m" (default, minimum)
  • "1 hour" or "1h"
  • "24 hours" or "24h"
  • "7 days" or "7d"
Valid units: minutes/m, hours/h, days/d, weeks/w, years/y
routing.models
array
Fallback models tried after the primary model’s providers are exhausted. Accepts model slugs, provider/model format, and "auto".
{
  "model": "gpt-4o",
  "routing": {
    "models": ["claude-sonnet-4-20250514", "auto"]
  }
}
The system tries gpt-4o first (across all suitable providers), then claude-sonnet-4-20250514, then auto-selects from remaining models.
routing.providers
array
Whitelist of providers to consider. When set, only these providers are used for routing. Omit to allow all providers.
{
  "model": "gpt-4o",
  "routing": {
    "providers": ["openai", "azure"]
  }
}

How It Works

1. Feature Detection

When your request arrives, the routing plugin scans it and builds a set of required features based on what you’re using:
You sendRequired feature
stream: truestream
tools with functionstools.function_calling
tools with web searchtools.web_search
tool_choice: "required"tool_choice.required
text.format.type: "json_schema"text.format.json_schema
reasoning.effort: "high"reasoning.effort.high
temperaturetemperature
These features are encoded into a bitmask for fast matching against a pre-built index of every model/provider combination and their capabilities.

2. Provider Selection

Providers matching all required features are sorted by your chosen metric. For live metrics (latency, uptime, etc.), the system also factors in:
  • Prompt cache affinity — providers where you have active cached tokens are prioritized
  • Feature uptime — providers whose success rate for any required feature drops below 90% are excluded

3. Fallback & Retry

If a provider fails, the system automatically tries the next provider in the sorted list:
Request -> Provider A (fails) -> Provider B (fails) -> Provider C (success)
This happens transparently. For non-streaming requests, any error triggers a retry. For streaming requests, retries only happen during the connection phase — once streaming begins, the request is committed.

4. Feature Degradation

If no provider supports all requested features, the system gracefully strips less important features to find a match. Features are stripped in this priority (least important first):
  1. Cache identity (prompt_cache_key, prompt_cache_retention)
  2. Output verbosity control (text.verbosity)
  3. Response metadata includes (include.*)
  4. Custom tools (tools.custom_tools)
  5. Parallel tool calls (parallel_tool_calls)
  6. Sampling parameters (top_p, temperature)
  7. Reasoning effort (reasoning.effort.*)
  8. Web search (tools.web_search)
  9. Tool choice controls (tool_choice.*)
  10. Structured output (text.format.json_schema, text.format.json_object)
  11. Function calling (tools.function_calling)
  12. Streaming (stream)
Core capabilities like streaming and function calling are stripped last, meaning the system will exhaust all other options before degrading these.

Examples

Cost-Optimized with Fallbacks

For high-volume workloads where you want the cheapest option with resilience:
{
  "model": "auto",
  "input": "Summarize this text",
  "routing": {
    "metric": "cost",
    "models": ["gpt-4o-mini", "gemini-2.5-flash"]
  }
}

Performance-Optimized

For complex reasoning or code generation:
{
  "model": "auto",
  "input": "Design a scalable microservices architecture",
  "routing": {
    "metric": "performance",
    "interval": "1 hour"
  }
}

Latency-Optimized

For real-time chat or interactive applications:
{
  "model": "auto",
  "input": "Quick translation: Hello -> Spanish",
  "routing": {
    "metric": "p50_latency",
    "interval": "15 minutes"
  }
}

Provider-Restricted

Limit routing to specific providers (e.g., for compliance):
{
  "model": "gpt-4o",
  "input": "Process this data",
  "routing": {
    "metric": "performance",
    "providers": ["openai", "azure"]
  }
}

Response Information

The response includes which provider and model were selected:
{
  "id": "resp_xyz789",
  "created_at": 1702934400,
  "status": "completed",
  "model": "anthropic/claude-haiku-4-5",
  "output": [...],
  "usage": {...}
}
Log the model field in responses to understand routing decisions over time.

Error Handling

When all providers are exhausted, the API returns the last provider’s error. Common scenarios:
StatusMeaning
424All providers failed (provider errors)
429All providers rate-limited
422ZDR enabled but no ZDR-supporting providers available for the requested features
If Zero Data Retention is enabled on your API key but no providers support ZDR for the required model/features, you will receive a 422 Unprocessable Entity error rather than falling back to a non-ZDR provider.
To improve reliability:
  • Add fallback models via routing.models
  • Use broader provider pools (don’t restrict routing.providers unnecessarily)
  • Use metric: "performance" (default) for the most stable behavior — it uses static ranking and doesn’t depend on live metrics availability

Best Practices

  • "cost": Content generation, summarization, simple Q&A
  • "performance" (default): Complex reasoning, code generation, analysis
  • "p50_latency": Real-time chat, interactive applications
  • "uptime": Mission-critical production workloads
  • 15 minutes (default): Most reactive to provider issues
  • 1 hour: Good balance of stability and responsiveness
  • 24 hours: Stable, long-term patterns
  • 7 days+: Historical trends, less reactive to spikes
Add routing.models for critical workloads. If the primary model’s providers all fail, the system automatically tries fallbacks:
{
  "model": "gpt-4o",
  "routing": {
    "models": ["claude-sonnet-4-20250514", "gemini-2.5-pro", "auto"]
  }
}
Placing "auto" last gives the system maximum flexibility as a final fallback.
Track which providers are being selected over time:
selected_providers = {}

for request in requests:
    response = make_request(request)
    model = response["model"]  # e.g. "openai/gpt-4o"
    selected_providers[model] = selected_providers.get(model, 0) + 1

print(selected_providers)
# {'anthropic/claude-haiku-4-5': 45, 'openai/gpt-4o-mini': 32, ...}
Set token limits to control costs even with auto routing:
{
  "model": "auto",
  "input": "Explain quantum physics",
  "routing": { "metric": "cost" },
  "max_output_tokens": 500
}

Create Response

Main endpoint documentation

Supported Models

View all available models

Error Handling

Handle routing failures

Request Parameters

Full parameter reference