Create Response
Start a stateful conversation with your model of choice
Overview
The main endpoint for generating AI responses. Supports both streaming and non-streaming modes, with automatic normalization across all providers.Guardrails
Redaction guardrails are configured on your API key (not in this endpoint body). When enabled, they are applied automatically for requests made with that key. See Guardrails & Redaction.Body
Text, image, or file inputs to the model, used to generate a response.
1Model identifier. Use /v1/models to list all available models. Supports canonical names (e.g. gpt-5.2, claude-opus-4-6), aliases, and provider-prefixed formats (e.g. openai/gpt-5.2). Use "auto" for automatic model selection.
Specify additional output data to include in the model response.
8web_search_call.results, web_search_call.action.sources, message.output_text.logprobs, message.input_image.image_url, reasoning.encrypted_content, file_search_call.results, computer_call_output.output.image_url, code_interpreter_call.outputs A system (or developer) message inserted into the model's context. When using along with previous_response_id, the instructions from a previous response will not be carried over to the next response. This makes it simple to swap out system (or developer) messages in new responses.
An upper bound for the number of tokens that can be generated for a response, including visible output tokens and reasoning tokens.
0 < x <= 9007199254740991Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard. Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
Configuration options for reasoning models.
If set to true, the model response data will be streamed to the client as it is generated using server-sent events.
What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
0 <= x <= 2Configuration options for a text response from the model. Can be plain text or structured JSON data.
An array of tools the model may call while generating a response. You can specify which tool to use by setting the tool_choice parameter.
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
0 <= x <= 1Whether to allow the model to run tool calls in parallel.
The unique ID of the previous response to the model. Use this to create multi-turn conversations. Cannot be used in conjunction with conversation. Concentrate enables this for all models. In order to be used, request logging must be enabled. Learn more.
Used to cache responses for similar requests to optimize your cache hit rates. Replaces the user field. If prompt_cache_key or user is not set, Concentrate will automatically add a prompt cache key based on your API key.
The retention policy for the prompt cache. Set to 24h to enable extended prompt caching, which keeps cached prefixes active for longer, up to a maximum of 24 hours. Has no effect on explicit caching, which must be set through cache_control.
in-memory, in_memory, 24h An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability.
0 <= x <= 20Whether to run the model response in the background. Unsupported, but included for compatibility.
Configuration for how the model's context window is managed during a response, such as automatic compaction of older turns. Currently unsupported.
The conversation that this response belongs to. Items from this conversation are prepended to input_items for this response request. Cannot be used in conjunction with previous_response_id. Currently unsupported, but included for compatibility. Use previous_response_id instead.
The maximum number of total calls to built-in tools that can be processed in a response. This maximum number applies across all built-in tool calls, not per individual tool. Any further attempts to call a tool by the model will be ignored.
0 < x <= 9007199254740991Reference to a prompt template and its variables. Currently unsupported, but included for compatibility.
A stable identifier used to help detect users of your application that may be violating usage policies. The IDs should be a string that uniquely identifies each user. We recommend hashing their username or email address, in order to avoid sending us any identifying information. Unsupported, as Concentrate reserves this field.
Specifies the processing type used for serving the request. Determines the pricing and performance tier used to process the request. When not set, the default behavior is auto. Currently unsupported, but included for compatibility.
auto, default, flex, scale, priority Whether to store the generated model response for later retrieval via API.
Options for streaming responses. Only set this when you set stream: true.
The truncation strategy to use for the model response. auto: if the input exceeds the model's context window size, the model truncates the response by dropping items from the beginning of the conversation. disabled (default): if the input size exceeds the context window size for a model, the request fails with a 400 error. Currently unsupported, but included for compatibility.
auto, disabled This field is being replaced by safety_identifier and prompt_cache_key. We recommend using prompt_cache_key instead to maintain caching optimizations. A stable identifier for your end-users. Used to boost cache hit rates by better bucketing similar requests and to help detect and prevent abuse. Using this as a safety identifier has no effect, but this value will be used for prompt_cache_key instead of a value based on your API key if provided.
Concentrate routing configuration controlling how requests are routed across models and providers. Learn more about routing.
Response
Default Response
response Set of 16 key-value pairs that can be attached to an object. This can be useful for storing additional information about the object in a structured format, and querying for objects via API or the dashboard. Keys are strings with a maximum length of 64 characters. Values are strings with a maximum length of 512 characters.
What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p but not both.
0 <= x <= 2An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. We generally recommend altering this or temperature but not both.
0 <= x <= 1-9007199254740991 <= x <= 9007199254740991-9007199254740991 <= x <= 9007199254740991-9007199254740991 <= x <= 9007199254740991-9007199254740991 <= x <= 9007199254740991Reference to a prompt template and its variables. Currently unsupported, but included for compatibility.
The retention policy for the prompt cache. Set to 24h to enable extended prompt caching, which keeps cached prefixes active for longer, up to a maximum of 24 hours. Has no effect on explicit caching, which must be set through cache_control.
in-memory, in_memory, 24h Configuration options for reasoning models.
Specifies the processing type used for serving the request. Determines the pricing and performance tier used to process the request. When not set, the default behavior is auto. Currently unsupported, but included for compatibility.
auto, default, flex, scale, priority The status of the item. One of in_progress, completed, or incomplete.
completed, in_progress, incomplete Configuration options for a text response from the model. Can be plain text or structured JSON data.
An integer between 0 and 20 specifying the number of most likely tokens to return at each token position, each with an associated log probability.
0 <= x <= 20The truncation strategy to use for the model response. auto: if the input exceeds the model's context window size, the model truncates the response by dropping items from the beginning of the conversation. disabled (default): if the input size exceeds the context window size for a model, the request fails with a 400 error. Currently unsupported, but included for compatibility.
auto, disabled