Documentation Index
Fetch the complete documentation index at: https://concentrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The routing system automatically selects the best provider and model for every request based on the features your request requires, real-time metrics, and your optimization preferences. It supports:- Auto model selection — set
model: "auto"and let the system choose - Provider routing — specify a model and let the system pick the best provider
- Multi-provider fallback — if one provider fails, the system retries on the next
- Feature degradation — if no provider supports all requested features, less important features are gracefully stripped
- ZDR enforcement — when Zero Data Retention is enabled, only ZDR-supporting providers are considered
Basic Usage
Auto Model Selection
Setmodel: "auto" to let the system select both the model and provider:
Provider Routing (Pinned Model)
Specify a model without a provider prefix and the system routes to the best provider:Pinned Provider
Useprovider/model format to pin a specific provider. Routing still provides fallback to other providers for the same model if the pinned one fails:
Routing Configuration
Control routing behavior with therouting parameter:
How providers are sorted and selected:Static Metrics:
"cost"— Sort by provider pricing (cheapest first)"performance"— Sort by quality/revenue-share tier (best first)
"avg_latency"— Average response time"min_latency","max_latency"— Min/max response time"p50_latency","p90_latency","p99_latency"— Percentile latencies"avg_e2e_latency","min_e2e_latency","max_e2e_latency"— End-to-end latency including overhead- Any percentile from p0 to p100:
"p75_latency","p95_e2e_latency", etc. "uptime"— Provider availability"throughput"— Requests per second"total_requests"— Total request volume"input_tokens","output_tokens","total_tokens"— Average token counts
Time window for live metric calculation. Only applies when using live metrics (latency, uptime, etc.). Static metrics (
cost, performance) ignore this.Format: "number unit" or "number<shorthand>""15 minutes"or"15m"(default, minimum)"1 hour"or"1h""24 hours"or"24h""7 days"or"7d"
Fallback models tried after the primary model’s providers are exhausted. Accepts model slugs, The system tries
provider/model format, and "auto".gpt-4o first (across all suitable providers), then claude-sonnet-4-20250514, then auto-selects from remaining models.Whitelist of providers to consider. When set, only these providers are used for routing. Omit to allow all providers.
How It Works
1. Feature Detection
When your request arrives, the routing plugin scans it and builds a set of required features based on what you’re using:| You send | Required feature |
|---|---|
stream: true | stream |
tools with functions | tools.function_calling |
tools with web search | tools.web_search |
tool_choice: "required" | tool_choice.required |
text.format.type: "json_schema" | text.format.json_schema |
reasoning.effort: "high" | reasoning.effort.high |
temperature | temperature |
2. Provider Selection
Providers matching all required features are sorted by your chosen metric. For live metrics (latency, uptime, etc.), the system also factors in:
- Prompt cache affinity — providers where you have active cached tokens are prioritized
- Feature uptime — providers whose success rate for any required feature drops below 90% are excluded
3. Fallback & Retry
If a provider fails, the system automatically tries the next provider in the sorted list:4. Feature Degradation
If no provider supports all requested features, the system gracefully strips less important features to find a match. Features are stripped in this priority (least important first):- Cache identity (
prompt_cache_key,prompt_cache_retention) - Output verbosity control (
text.verbosity) - Response metadata includes (
include.*) - Custom tools (
tools.custom_tools) - Parallel tool calls (
parallel_tool_calls) - Sampling parameters (
top_p,temperature) - Reasoning effort (
reasoning.effort.*) - Web search (
tools.web_search) - Tool choice controls (
tool_choice.*) - Structured output (
text.format.json_schema,text.format.json_object) - Function calling (
tools.function_calling) - Streaming (
stream)
Core capabilities like streaming and function calling are stripped last, meaning the system will exhaust all other options before degrading these.
Examples
Cost-Optimized with Fallbacks
For high-volume workloads where you want the cheapest option with resilience:Performance-Optimized
For complex reasoning or code generation:Latency-Optimized
For real-time chat or interactive applications:Provider-Restricted
Limit routing to specific providers (e.g., for compliance):Response Information
The response includes which provider and model were selected:Error Handling
When all providers are exhausted, the API returns the last provider’s error. Common scenarios:| Status | Meaning |
|---|---|
424 | All providers failed (provider errors) |
429 | All providers rate-limited |
422 | ZDR enabled but no ZDR-supporting providers available for the requested features |
- Add fallback models via
routing.models - Use broader provider pools (don’t restrict
routing.providersunnecessarily) - Use
metric: "performance"(default) for the most stable behavior — it uses static ranking and doesn’t depend on live metrics availability
Best Practices
Match metric to use case
Match metric to use case
"cost": Content generation, summarization, simple Q&A"performance"(default): Complex reasoning, code generation, analysis"p50_latency": Real-time chat, interactive applications"uptime": Mission-critical production workloads
Use appropriate intervals
Use appropriate intervals
- 15 minutes (default): Most reactive to provider issues
- 1 hour: Good balance of stability and responsiveness
- 24 hours: Stable, long-term patterns
- 7 days+: Historical trends, less reactive to spikes
Configure fallback models
Configure fallback models
Add Placing
routing.models for critical workloads. If the primary model’s providers all fail, the system automatically tries fallbacks:"auto" last gives the system maximum flexibility as a final fallback.Monitor selected providers
Monitor selected providers
Track which providers are being selected over time:
Combine with max_output_tokens
Combine with max_output_tokens
Set token limits to control costs even with auto routing:
Related Documentation
Create Response
Main endpoint documentation
Supported Models
View all available models
Error Handling
Handle routing failures
Request Parameters
Full parameter reference