Skip to main content

Documentation Index

Fetch the complete documentation index at: https://concentrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Overview

This page provides a comprehensive reference for all parameters you can use when creating responses. Parameters are organized by category for easy navigation.

Required Parameters

model

model
string
required
The AI model to use for generating the response.Format Options:
  • Model name only: "gpt-5.2" - Automatic provider routing
  • Provider-prefixed: "openai/gpt-5.2" - Specific provider
  • Auto routing: "auto" - Intelligent selection based on criteria
Examples:
{
  "model": "gpt-5.2" // Automatic provider routing
}
{
  "model": "anthropic/claude-opus-4-5" // Specific provider
}
{
  "model": "auto", // Auto routing
  "routing": {
    "strategy": "min",
    "metric": "cost"
  }
}
See the Model Fortress on the app for a full list.

input

input
string | array
required
The input to send to the model. Can be either a simple string or an array of message/tool objects for conversations.String Format:
{
  "input": "What is the capital of France?"
}
Conversation Format:
{
  "input": [
    {
      "role": "system",
      "content": "You are a helpful assistant specialized in geography."
    },
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ]
}
Array Item Types:The input array can contain the following types of objects:1. Message ObjectsStandard conversation messages:
  • type (optional): “message” (default)
  • role (required): “user”, “assistant”, “system”, or “developer”
  • content (required): String or array of content blocks (e.g., [{ "type": "input_text", "text": "..." }] or [{ "type": "input_image", "image_url": "..." }]). See Multi-Modal Inputs for image support.
  • cache_control (optional): Cache control settings (see Prompt Caching)
2. Function Call ObjectsUsed when the model calls a tool and you need to continue the conversation:
{
  "type": "function_call",
  "call_id": "call_abc123",
  "name": "get_weather",
  "arguments": "{\"location\": \"San Francisco, CA\"}",
  "status": "completed"
}
Properties:
  • type (required): “function_call”
  • call_id (required): Unique identifier for this function call
  • name (required): Function name that was called
  • arguments (required): JSON string of the function arguments
  • status (optional): “completed”, “in_progress”, or “incomplete”
  • cache_control (optional): Cache control settings
3. Function Call Output ObjectsUsed to send the result of a function call back to the model:
{
  "type": "function_call_output",
  "call_id": "call_abc123",
  "output": "{\"temperature\": 72, \"conditions\": \"sunny\"}",
  "is_error": false
}
Properties:
  • type (required): “function_call_output”
  • call_id (required): Must match the call_id from the function_call
  • output (required): String or array containing the function result
  • is_error (optional): Boolean indicating if the function execution failed
Multi-Turn Tool Calling Example:
{
  "model": "gpt-5.2",
  "input": [
    {
      "role": "user",
      "content": "What's the weather in San Francisco?"
    },
    {
      "type": "function_call",
      "call_id": "call_abc123",
      "name": "get_weather",
      "arguments": "{\"location\": \"San Francisco, CA\"}"
    },
    {
      "type": "function_call_output",
      "call_id": "call_abc123",
      "output": "{\"temperature\": 72, \"conditions\": \"sunny\"}"
    }
  ],
  "tools": [...]
}
See Tool Calling Guide for complete workflow examples.

Output Control Parameters

text

text
object
Configure the format of the model’s text output, including structured output.Properties:
  • format (required): Object controlling the output format
    • type (required): "text" | "json_schema" | "json_object"
    • name (required for json_schema): Schema name
    • schema (required for json_schema): JSON Schema object
    • description (optional): Description of the expected output
    • strict (optional): Enable strict schema enforcement
Example:
{
  "model": "gpt-5.2",
  "input": "Extract the person's name and age",
  "text": {
    "format": {
      "type": "json_schema",
      "name": "person",
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "age": { "type": "integer" }
        },
        "required": ["name", "age"],
        "additionalProperties": false
      }
    }
  }
}
See Structured Output for complete documentation and examples.

max_output_tokens

max_output_tokens
integer
Maximum number of tokens to generate in the response.Important Notes:
  • If not specified, uses the model’s default limit or your credit limit (whichever is lower)
  • Highly recommended to set this to avoid unexpectedly long and expensive responses
  • Different models have different maximum output token limits
Examples:
{
  "model": "gpt-5.2",
  "input": "Write a short story",
  "max_output_tokens": 500
}
Model Limits:
ModelMax Output Tokens
GPT-5.216,384
Claude Opus 4.516,384
Claude Sonnet 4.516,384
Gemini 2.5 Pro65,536
o1100,000

Sampling Parameters

These parameters control the randomness and creativity of model outputs.

temperature

temperature
number
Controls randomness in the output. Range: 0.0 to 2.0Values:
  • 0.0 - 0.3: Very focused and deterministic
    • Use for: Code generation, factual tasks, data extraction
  • 0.4 - 0.7: Balanced creativity and coherence
    • Use for: General conversation, Q&A, explanations
  • 0.8 - 1.2: Creative and varied
    • Use for: Creative writing, brainstorming, storytelling
  • 1.3 - 2.0: Highly random and experimental
    • Use for: Highly creative tasks, unconventional ideas
Examples:
// Factual, deterministic output
{
  "model": "gpt-5.2",
  "input": "Write a function to sort an array",
  "temperature": 0.2
}
// Creative writing
{
  "model": "claude-opus-4-5",
  "input": "Write a short story about a robot",
  "temperature": 0.9
}
Temperatures above 1.5 can produce incoherent or nonsensical outputs. Use with caution.

top_p

top_p
number
Nucleus sampling parameter. Range: 0.0 to 1.0How it works:
  • Controls diversity by limiting token selection to the top probability mass
  • Alternative to temperature for controlling randomness
  • Lower values = more focused, higher values = more diverse
Recommended Usage:
  • 0.1 - 0.3: Very focused outputs
  • 0.4 - 0.7: Balanced outputs
  • 0.8 - 1.0: Diverse outputs
Example:
{
  "model": "gpt-5.2",
  "input": "Suggest product names",
  "top_p": 0.9 // More diverse suggestions
}
Generally, only use one of temperature or top_p at a time, not both. If you specify both, temperature typically takes precedence depending on the model.

Streaming

stream

stream
boolean
default:false
Enable real-time streaming of the response using Server-Sent Events (SSE).When to use:
  • ✅ Chat interfaces
  • ✅ Long-form content generation
  • ✅ When user experience matters
  • ✅ Progressive display of results
When not to use:
  • ❌ Batch processing
  • ❌ API integrations where full response is needed
  • ❌ Simple programmatic tasks
Example:
{
  "model": "gpt-5.2",
  "input": "Write a long essay on AI",
  "stream": true
}
See Streaming Documentation for complete implementation details.

Advanced Features

reasoning

reasoning
object
Enable and configure reasoning mode for models that support it (e.g., o1, command-a-reasoning).Properties:
  • effort (required): “low” | “medium” | “high” - Amount of reasoning effort to apply
Example:
{
  "model": "openai/o1",
  "input": "Solve this complex math problem: ...",
  "reasoning": {
    "effort": "high"
  }
}
Effort Levels:
  • low: Basic reasoning, faster, lower cost
  • medium: Balanced reasoning and speed
  • high: Deep reasoning, slower, higher cost (more reasoning tokens)
    Reasoning tokens are counted separately in usage statistics and may be priced differently than regular output tokens.

routing

routing
object
Routing configuration for provider selection, fallback models, and optimization.Properties:
routing.metric
string
default:"performance"
How providers are sorted and selected:Static Metrics:
  • "cost" — Sort by provider pricing (cheapest first)
  • "performance" — Sort by quality/revenue-share tier (best first, default)
Live Metrics (from Redis over the configured interval):Latency:
  • "avg_latency" — Average response time
  • "min_latency", "max_latency" — Min/max response time
  • "p50_latency", "p90_latency", "p99_latency" — Percentile latencies
  • "avg_e2e_latency", "min_e2e_latency", "max_e2e_latency" — End-to-end latency including overhead
You can use any percentile from p0 to p100 for both latency and e2e_latency:
  • Format: "p75_latency", "p85_latency", "p50_e2e_latency", "p99_e2e_latency", etc.
Reliability & Volume:
  • "uptime" — Provider availability percentage
  • "throughput" — Requests per second
  • "total_requests" — Total request volume
Token Metrics:
  • "input_tokens", "output_tokens", "total_tokens" — Average token counts
routing.interval
string
default:"15 minutes"
Time window for live metric calculation. Ignored for static metrics (cost, performance).Format: "number unit" or "number<shorthand>"Valid Units:
  • minutes or m — “15 minutes”, “30 minutes”, “15m”, “30m”
  • hours or h — “1 hour”, “6 hours”, “24 hours”, “1h”, “6h”, “24h”
  • days or d — “7 days”, “30 days”, “7d”, “30d”
  • weeks or w — “1 week”, “4 weeks”, “1w”, “4w”
  • years or y — “1 year”, “1y”
Minimum interval is 15 minutes.
routing.models
array
Fallback models tried after the primary model’s providers are exhausted. Accepts model slugs, provider/model format, and "auto".
{ "routing": { "models": ["claude-sonnet-4-20250514", "auto"] } }
routing.providers
array
Whitelist of providers. When set, only these providers are considered. Omit to allow all.
{ "routing": { "providers": ["openai", "azure"] } }
Complete Examples:
// Optimize for cost with fallback models
{
  "model": "gpt-4o",
  "input": "Summarize this text",
  "routing": {
    "metric": "cost",
    "models": ["gemini-2.0-flash"]
  }
}
// Performance-optimized (default behavior)
{
  "model": "auto",
  "input": "Complex analysis task",
  "routing": {
    "metric": "performance",
    "interval": "1 hour"
  }
}
// Low-latency with provider restriction
{
  "model": "auto",
  "input": "Quick question",
  "routing": {
    "metric": "p99_latency",
    "interval": "15 minutes",
    "providers": ["openai", "anthropic"]
  }
}
See Routing Documentation for the full guide.

Guardrails (API Key Policy)

Guardrails are configured at the API key level, not in the /v1/responses request body.
Configure guardrails in the dashboard UI (Guardrails page) on your API key. No additional request parameter is required in /v1/responses.
See Guardrails & Redaction for setup and behavior.

Tool Calling

tools

tools
array
Array of tools the model can call. Each tool is a function definition with a JSON Schema.Tool Definition:
  • type (required): “function” - Type of tool
  • name (required): string - Function name (alphanumeric, underscores, dots, hyphens)
  • description (optional): string - What the function does
  • parameters (required): object - JSON Schema for function parameters
  • strict (optional): boolean - Enable strict schema validation (default: true)
  • cache_control (optional): object - Cache this tool definition (ephemeral, 5m or 1h TTL)
Example:
{
  "tools": [
    {
      "type": "function",
      "name": "get_weather",
      "description": "Get current weather for a location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "City and state, e.g. San Francisco, CA"
          },
          "unit": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"]
          }
        },
        "required": ["location"]
      }
    }
  ]
}
See Tool Calling Guide for complete examples.

tool_choice

tool_choice
string | object
Control which tools the model uses.Modes:
  • "none" - Don’t use any tools
  • "auto" - Let model decide (default)
  • "required" - Force model to use at least one tool
  • { "type": "function", "name": "tool_name" } - Force specific tool
  • { "type": "allowed_tools", "mode": "auto", "tools": [...] } - Limit to specific tools
Examples:
// Auto mode (default)
{ "tool_choice": "auto" }

// Force specific tool
{
  "tool_choice": {
    "type": "function",
    "name": "get_weather"
  }
}

// Allowed tools
{
  "tool_choice": {
    "type": "allowed_tools",
    "mode": "required",
    "tools": [
      { "type": "function", "name": "get_weather" },
      { "type": "function", "name": "get_forecast" }
    ]
  }
}

parallel_tool_calls

parallel_tool_calls
boolean
Enable the model to call multiple tools in parallel in a single response.
  • true - Model can call multiple tools simultaneously (faster for independent operations)
  • false - Model calls one tool at a time (default for some providers)
When to enable:
  • Multiple independent tool calls (e.g., get weather for multiple cities)
  • No dependencies between tool calls
When to disable:
  • Sequential operations where order matters
  • Tool calls depend on each other’s results

Prompt Caching

cache_control

cache_control
object
Enable prompt caching for specific messages to reduce costs on repeated prefixes.
Currently supported by:
  • Anthropic provider (Claude models via Anthropic API)
  • AWS Bedrock provider (Claude models via AWS Bedrock)
All other providers will ignore cache_control settings.
Properties:
  • type (required): “ephemeral” - Type of cache
  • ttl (required): “5m” | “1h” - Time-to-live for the cache
How it works:
  • Mark messages that should be cached
  • Subsequent requests with the same prefix will use cached tokens
  • Cached tokens are significantly cheaper than regular input tokens
  • Cache expires after the specified TTL
Example:
{
  "model": "anthropic/claude-opus-4-5",
  "input": [
    {
      "role": "system",
      "content": "Very long system prompt with documentation...",
      "cache_control": {
        "type": "ephemeral",
        "ttl": "5m"
      }
    },
    {
      "role": "user",
      "content": "Question based on the documentation"
    }
  ]
}
Cost Savings:
  • Regular input tokens: Full price
  • Cache write: ~25% more than input tokens (one-time cost)
  • Cache read: ~90% cheaper than input tokens
Best Practices:
  • Only cache substantial prefixes (e.g., over 1000 tokens)
  • Use for repeated system prompts or context
  • Choose TTL based on your usage pattern:
    • "5m" for rapid successive requests
    • "1h" for regular usage over longer periods

Create Response

Main endpoint documentation

Auto Routing

Intelligent model selection

Streaming

Real-time response streaming

Multi-Modal

Send images to vision models

Structured Output

Force JSON responses matching a schema

Prompt Caching

Reduce costs with caching

Guardrails & Redaction

API-key-level redaction controls