Routing

Overview

The routing system automatically selects the best provider and model for every request based on the features your request requires, real-time metrics, and your optimization preferences. It supports:

Auto model selection — set model: "auto" and let the system choose
Provider routing — specify a model and let the system pick the best provider
Multi-provider fallback — if one provider fails, the system retries on the next
Feature degradation — if no provider supports all requested features, less important features are gracefully stripped
ZDR enforcement — when Zero Data Retention is enabled, only ZDR-supporting providers are considered

Basic Usage

Auto Model Selection

Set model: "auto" to let the system select both the model and provider:

curl https://api.concentrate.ai/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "auto",
    "input": "Explain machine learning"
  }'

Provider Routing (Pinned Model)

Specify a model without a provider prefix and the system routes to the best provider:

{
  "model": "gpt-4o",
  "input": "Hello",
  "routing": {
    "metric": "cost"
  }
}

This selects the cheapest provider for GPT-4o (e.g., OpenAI vs Azure) based on real-time data.

Pinned Provider

Use provider/model format to pin a specific provider. Routing still provides fallback to other providers for the same model if the pinned one fails:

{
  "model": "openai/gpt-4o",
  "input": "Hello"
}

Routing Configuration

Control routing behavior with the routing parameter:

routing.metric

string

default:"performance"

How providers are sorted and selected:Static Metrics:

"cost" — Sort by provider pricing (cheapest first)
"performance" — Sort by quality/revenue-share tier (best first)

Live Metrics (from Redis):

"avg_latency" — Average response time
"min_latency", "max_latency" — Min/max response time
"p50_latency", "p90_latency", "p99_latency" — Percentile latencies
"avg_e2e_latency", "min_e2e_latency", "max_e2e_latency" — End-to-end latency including overhead
Any percentile from p0 to p100: "p75_latency", "p95_e2e_latency", etc.
"uptime" — Provider availability
"throughput" — Requests per second
"total_requests" — Total request volume
"input_tokens", "output_tokens", "total_tokens" — Average token counts

routing.interval

string

default:"15 minutes"

Time window for live metric calculation. Only applies when using live metrics (latency, uptime, etc.). Static metrics (cost, performance) ignore this.Format: "number unit" or "number<shorthand>"

"15 minutes" or "15m" (default, minimum)
"1 hour" or "1h"
"24 hours" or "24h"
"7 days" or "7d"

Valid units: minutes/m, hours/h, days/d, weeks/w, years/y

routing.models

array

Fallback models tried after the primary model’s providers are exhausted. Accepts model slugs, provider/model format, and "auto".

{
  "model": "gpt-4o",
  "routing": {
    "models": ["claude-sonnet-4-20250514", "auto"]
  }
}

The system tries gpt-4o first (across all suitable providers), then claude-sonnet-4-20250514, then auto-selects from remaining models.

routing.providers

array

Whitelist of providers to consider. When set, only these providers are used for routing. Omit to allow all providers.

{
  "model": "gpt-4o",
  "routing": {
    "providers": ["openai", "azure"]
  }
}

How It Works

1. Feature Detection

When your request arrives, the routing plugin scans it and builds a set of required features based on what you’re using:

You send	Required feature
`stream: true`	`stream`
`tools` with functions	`tools.function_calling`
`tools` with web search	`tools.web_search`
`tool_choice: "required"`	`tool_choice.required`
`text.format.type: "json_schema"`	`text.format.json_schema`
`reasoning.effort: "high"`	`reasoning.effort.high`
`temperature`	`temperature`

These features are encoded into a bitmask for fast matching against a pre-built index of every model/provider combination and their capabilities.

2. Provider Selection

Providers matching all required features are sorted by your chosen metric. For live metrics (latency, uptime, etc.), the system also factors in:

Prompt cache affinity — providers where you have active cached tokens are prioritized
Feature uptime — providers whose success rate for any required feature drops below 90% are excluded

3. Fallback & Retry

If a provider fails, the system automatically tries the next provider in the sorted list:

Request -> Provider A (fails) -> Provider B (fails) -> Provider C (success)

This happens transparently. For non-streaming requests, any error triggers a retry. For streaming requests, retries only happen during the connection phase — once streaming begins, the request is committed.

4. Feature Degradation

If no provider supports all requested features, the system gracefully strips less important features to find a match. Features are stripped in this priority (least important first):

Cache identity (prompt_cache_key, prompt_cache_retention)
Output verbosity control (text.verbosity)
Response metadata includes (include.*)
Custom tools (tools.custom_tools)
Parallel tool calls (parallel_tool_calls)
Sampling parameters (top_p, temperature)
Reasoning effort (reasoning.effort.*)
Web search (tools.web_search)
Tool choice controls (tool_choice.*)
Structured output (text.format.json_schema, text.format.json_object)
Function calling (tools.function_calling)
Streaming (stream)

Core capabilities like streaming and function calling are stripped last, meaning the system will exhaust all other options before degrading these.

Examples

Cost-Optimized with Fallbacks

For high-volume workloads where you want the cheapest option with resilience:

{
  "model": "auto",
  "input": "Summarize this text",
  "routing": {
    "metric": "cost",
    "models": ["gpt-4o-mini", "gemini-2.0-flash"]
  }
}

Performance-Optimized

For complex reasoning or code generation:

{
  "model": "auto",
  "input": "Design a scalable microservices architecture",
  "routing": {
    "metric": "performance",
    "interval": "1 hour"
  }
}

Latency-Optimized

For real-time chat or interactive applications:

{
  "model": "auto",
  "input": "Quick translation: Hello -> Spanish",
  "routing": {
    "metric": "p50_latency",
    "interval": "15 minutes"
  }
}

Provider-Restricted

Limit routing to specific providers (e.g., for compliance):

{
  "model": "gpt-4o",
  "input": "Process this data",
  "routing": {
    "metric": "performance",
    "providers": ["openai", "azure"]
  }
}

Response Information

The response includes which provider and model were selected:

{
  "id": "resp_xyz789",
  "created_at": 1702934400,
  "status": "completed",
  "model": "anthropic/claude-haiku-4-5",
  "output": [...],
  "usage": {...}
}

Log the model field in responses to understand routing decisions over time.

Error Handling

When all providers are exhausted, the API returns the last provider’s error. Common scenarios:

Status	Meaning
`424`	All providers failed (provider errors)
`429`	All providers rate-limited
`422`	ZDR enabled but no ZDR-supporting providers available for the requested features

If Zero Data Retention is enabled on your API key but no providers support ZDR for the required model/features, you will receive a 422 Unprocessable Entity error rather than falling back to a non-ZDR provider.

To improve reliability:

Add fallback models via routing.models
Use broader provider pools (don’t restrict routing.providers unnecessarily)
Use metric: "performance" (default) for the most stable behavior — it uses static ranking and doesn’t depend on live metrics availability

Best Practices

Match metric to use case

"cost": Content generation, summarization, simple Q&A
"performance" (default): Complex reasoning, code generation, analysis
"p50_latency": Real-time chat, interactive applications
"uptime": Mission-critical production workloads

Use appropriate intervals

15 minutes (default): Most reactive to provider issues
1 hour: Good balance of stability and responsiveness
24 hours: Stable, long-term patterns
7 days+: Historical trends, less reactive to spikes

Configure fallback models

Add routing.models for critical workloads. If the primary model’s providers all fail, the system automatically tries fallbacks:

{
  "model": "gpt-4o",
  "routing": {
    "models": ["claude-sonnet-4-20250514", "gemini-2.5-pro", "auto"]
  }
}

Placing "auto" last gives the system maximum flexibility as a final fallback.

Monitor selected providers

Track which providers are being selected over time:

selected_providers = {}

for request in requests:
    response = make_request(request)
    model = response["model"]  # e.g. "openai/gpt-4o"
    selected_providers[model] = selected_providers.get(model, 0) + 1

print(selected_providers)
# {'anthropic/claude-haiku-4-5': 45, 'openai/gpt-4o-mini': 32, ...}

Combine with max_output_tokens

Set token limits to control costs even with auto routing:

{
  "model": "auto",
  "input": "Explain quantum physics",
  "routing": { "metric": "cost" },
  "max_output_tokens": 500
}

Create Response

Main endpoint documentation

Supported Models

View all available models

Error Handling

Handle routing failures

Request Parameters

Full parameter reference

API documentation

Responses

Chat Completions (Beta)

Messages (Beta)

Models

Utilities

Features

Reference

Overview

Basic Usage

Auto Model Selection

Provider Routing (Pinned Model)

Pinned Provider

Routing Configuration

How It Works

1. Feature Detection

2. Provider Selection

3. Fallback & Retry

4. Feature Degradation

Examples

Cost-Optimized with Fallbacks

Performance-Optimized

Latency-Optimized

Provider-Restricted

Response Information

Error Handling

Best Practices

Create Response

Supported Models

Error Handling

Request Parameters

API documentation

Responses

Chat Completions (Beta)

Messages (Beta)

Models

Utilities

Features

Reference

Documentation Index

​Overview

​Basic Usage

​Auto Model Selection

​Provider Routing (Pinned Model)

​Pinned Provider

​Routing Configuration

​How It Works

​1. Feature Detection

​2. Provider Selection

​3. Fallback & Retry

​4. Feature Degradation

​Examples

​Cost-Optimized with Fallbacks

​Performance-Optimized

​Latency-Optimized

​Provider-Restricted

​Response Information

​Error Handling

​Best Practices

​Related Documentation

Create Response

Supported Models

Error Handling

Request Parameters

Overview

Basic Usage

Auto Model Selection

Provider Routing (Pinned Model)

Pinned Provider

Routing Configuration

How It Works

1. Feature Detection

2. Provider Selection

3. Fallback & Retry

4. Feature Degradation

Examples

Cost-Optimized with Fallbacks

Performance-Optimized

Latency-Optimized

Provider-Restricted

Response Information

Error Handling

Best Practices

Related Documentation