Skip to main content

Documentation Index

Fetch the complete documentation index at: https://concentrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Overview

The routing system automatically selects the best provider and model for every request based on the features your request requires, real-time metrics, and your optimization preferences. It supports:
  • Auto model selection — set model: "auto" and let the system choose
  • Provider routing — specify a model and let the system pick the best provider
  • Multi-provider fallback — if one provider fails, the system retries on the next
  • Feature degradation — if no provider supports all requested features, less important features are gracefully stripped
  • ZDR enforcement — when Zero Data Retention is enabled, only ZDR-supporting providers are considered

Basic Usage

Auto Model Selection

Set model: "auto" to let the system select both the model and provider:
curl https://api.concentrate.ai/v1/responses \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "auto",
    "input": "Explain machine learning"
  }'

Provider Routing (Pinned Model)

Specify a model without a provider prefix and the system routes to the best provider:
{
  "model": "gpt-4o",
  "input": "Hello",
  "routing": {
    "metric": "cost"
  }
}
This selects the cheapest provider for GPT-4o (e.g., OpenAI vs Azure) based on real-time data.

Pinned Provider

Use provider/model format to pin a specific provider. Routing still provides fallback to other providers for the same model if the pinned one fails:
{
  "model": "openai/gpt-4o",
  "input": "Hello"
}

Routing Configuration

Control routing behavior with the routing parameter:
routing.metric
string
default:"performance"
How providers are sorted and selected:Static Metrics:
  • "cost" — Sort by provider pricing (cheapest first)
  • "performance" — Sort by quality/revenue-share tier (best first)
Live Metrics (from Redis):
  • "avg_latency" — Average response time
  • "min_latency", "max_latency" — Min/max response time
  • "p50_latency", "p90_latency", "p99_latency" — Percentile latencies
  • "avg_e2e_latency", "min_e2e_latency", "max_e2e_latency" — End-to-end latency including overhead
  • Any percentile from p0 to p100: "p75_latency", "p95_e2e_latency", etc.
  • "uptime" — Provider availability
  • "throughput" — Requests per second
  • "total_requests" — Total request volume
  • "input_tokens", "output_tokens", "total_tokens" — Average token counts
routing.interval
string
default:"15 minutes"
Time window for live metric calculation. Only applies when using live metrics (latency, uptime, etc.). Static metrics (cost, performance) ignore this.Format: "number unit" or "number<shorthand>"
  • "15 minutes" or "15m" (default, minimum)
  • "1 hour" or "1h"
  • "24 hours" or "24h"
  • "7 days" or "7d"
Valid units: minutes/m, hours/h, days/d, weeks/w, years/y
routing.models
array
Fallback models tried after the primary model’s providers are exhausted. Accepts model slugs, provider/model format, and "auto".
{
  "model": "gpt-4o",
  "routing": {
    "models": ["claude-sonnet-4-20250514", "auto"]
  }
}
The system tries gpt-4o first (across all suitable providers), then claude-sonnet-4-20250514, then auto-selects from remaining models.
routing.providers
array
Whitelist of providers to consider. When set, only these providers are used for routing. Omit to allow all providers.
{
  "model": "gpt-4o",
  "routing": {
    "providers": ["openai", "azure"]
  }
}

How It Works

1. Feature Detection

When your request arrives, the routing plugin scans it and builds a set of required features based on what you’re using:
You sendRequired feature
stream: truestream
tools with functionstools.function_calling
tools with web searchtools.web_search
tool_choice: "required"tool_choice.required
text.format.type: "json_schema"text.format.json_schema
reasoning.effort: "high"reasoning.effort.high
temperaturetemperature
These features are encoded into a bitmask for fast matching against a pre-built index of every model/provider combination and their capabilities.

2. Provider Selection

Providers matching all required features are sorted by your chosen metric. For live metrics (latency, uptime, etc.), the system also factors in:
  • Prompt cache affinity — providers where you have active cached tokens are prioritized
  • Feature uptime — providers whose success rate for any required feature drops below 90% are excluded

3. Fallback & Retry

If a provider fails, the system automatically tries the next provider in the sorted list:
Request -> Provider A (fails) -> Provider B (fails) -> Provider C (success)
This happens transparently. For non-streaming requests, any error triggers a retry. For streaming requests, retries only happen during the connection phase — once streaming begins, the request is committed.

4. Feature Degradation

If no provider supports all requested features, the system gracefully strips less important features to find a match. Features are stripped in this priority (least important first):
  1. Cache identity (prompt_cache_key, prompt_cache_retention)
  2. Output verbosity control (text.verbosity)
  3. Response metadata includes (include.*)
  4. Custom tools (tools.custom_tools)
  5. Parallel tool calls (parallel_tool_calls)
  6. Sampling parameters (top_p, temperature)
  7. Reasoning effort (reasoning.effort.*)
  8. Web search (tools.web_search)
  9. Tool choice controls (tool_choice.*)
  10. Structured output (text.format.json_schema, text.format.json_object)
  11. Function calling (tools.function_calling)
  12. Streaming (stream)
Core capabilities like streaming and function calling are stripped last, meaning the system will exhaust all other options before degrading these.

Examples

Cost-Optimized with Fallbacks

For high-volume workloads where you want the cheapest option with resilience:
{
  "model": "auto",
  "input": "Summarize this text",
  "routing": {
    "metric": "cost",
    "models": ["gpt-4o-mini", "gemini-2.0-flash"]
  }
}

Performance-Optimized

For complex reasoning or code generation:
{
  "model": "auto",
  "input": "Design a scalable microservices architecture",
  "routing": {
    "metric": "performance",
    "interval": "1 hour"
  }
}

Latency-Optimized

For real-time chat or interactive applications:
{
  "model": "auto",
  "input": "Quick translation: Hello -> Spanish",
  "routing": {
    "metric": "p50_latency",
    "interval": "15 minutes"
  }
}

Provider-Restricted

Limit routing to specific providers (e.g., for compliance):
{
  "model": "gpt-4o",
  "input": "Process this data",
  "routing": {
    "metric": "performance",
    "providers": ["openai", "azure"]
  }
}

Response Information

The response includes which provider and model were selected:
{
  "id": "resp_xyz789",
  "created_at": 1702934400,
  "status": "completed",
  "model": "anthropic/claude-haiku-4-5",
  "output": [...],
  "usage": {...}
}
Log the model field in responses to understand routing decisions over time.

Error Handling

When all providers are exhausted, the API returns the last provider’s error. Common scenarios:
StatusMeaning
424All providers failed (provider errors)
429All providers rate-limited
422ZDR enabled but no ZDR-supporting providers available for the requested features
If Zero Data Retention is enabled on your API key but no providers support ZDR for the required model/features, you will receive a 422 Unprocessable Entity error rather than falling back to a non-ZDR provider.
To improve reliability:
  • Add fallback models via routing.models
  • Use broader provider pools (don’t restrict routing.providers unnecessarily)
  • Use metric: "performance" (default) for the most stable behavior — it uses static ranking and doesn’t depend on live metrics availability

Best Practices

  • "cost": Content generation, summarization, simple Q&A
  • "performance" (default): Complex reasoning, code generation, analysis
  • "p50_latency": Real-time chat, interactive applications
  • "uptime": Mission-critical production workloads
  • 15 minutes (default): Most reactive to provider issues
  • 1 hour: Good balance of stability and responsiveness
  • 24 hours: Stable, long-term patterns
  • 7 days+: Historical trends, less reactive to spikes
Add routing.models for critical workloads. If the primary model’s providers all fail, the system automatically tries fallbacks:
{
  "model": "gpt-4o",
  "routing": {
    "models": ["claude-sonnet-4-20250514", "gemini-2.5-pro", "auto"]
  }
}
Placing "auto" last gives the system maximum flexibility as a final fallback.
Track which providers are being selected over time:
selected_providers = {}

for request in requests:
    response = make_request(request)
    model = response["model"]  # e.g. "openai/gpt-4o"
    selected_providers[model] = selected_providers.get(model, 0) + 1

print(selected_providers)
# {'anthropic/claude-haiku-4-5': 45, 'openai/gpt-4o-mini': 32, ...}
Set token limits to control costs even with auto routing:
{
  "model": "auto",
  "input": "Explain quantum physics",
  "routing": { "metric": "cost" },
  "max_output_tokens": 500
}

Create Response

Main endpoint documentation

Supported Models

View all available models

Error Handling

Handle routing failures

Request Parameters

Full parameter reference