Skip to main content

Documentation Index

Fetch the complete documentation index at: https://concentrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Prompt caching allows you to cache portions of your prompts that are reused across multiple requests, significantly reducing costs and improving performance. This is especially valuable for applications with large system prompts, extensive context, or documentation.
Availability: Currently supported by:
  • Anthropic provider (Claude models via Anthropic API)
  • AWS Bedrock provider (Claude models via AWS Bedrock)
All other providers will ignore cache control settings.

How It Works

Prompt caching works by storing specific message content and reusing it across requests:
  1. First Request: You mark messages with cache_control settings
  2. Cache Write: The API writes those messages to a cache
  3. Subsequent Requests: Identical cached content is retrieved instead of processed
  4. Cache Expiry: Cache expires after the specified time-to-live (TTL)

Benefits

  • Cache read tokens: ~90% cheaper than regular input tokens
  • Ideal for: Large system prompts, documentation, repeated context
  • Break-even: Typically after 2-3 requests with the same prefix
  • Reduced processing time for cached content - Faster response generation - Lower latency for requests with cached prefixes
  • Ensures identical system prompts across requests
  • Maintains consistency in multi-turn conversations
  • Simplifies prompt management

Basic Usage

Add cache_control to any message in your input array:
{
  "model": "anthropic/claude-opus-4-5",
  "input": [
    {
      "role": "system",
      "content": "Very long system prompt with detailed instructions...",
      "cache_control": {
        "type": "ephemeral",
        "ttl": "5m"
      }
    },
    {
      "role": "user",
      "content": "User question here"
    }
  ]
}

Cache Control Parameters

cache_control.type
string
required
Type of cache to use.Currently only "ephemeral" is supported, which means:
  • Cache is temporary and will expire
  • Not persisted across API restarts
  • Shared across your API key’s requests
cache_control.ttl
string
required
Time-to-live for the cached content.Options: "5m" (5 minutes) or "1h" (1 hour)Use "5m" for rapid successive requests and real-time conversations (e.g., chat applications).Use "1h" for regular usage patterns and batch processing (e.g., document analysis sessions).After the TTL expires, the next request will perform a cache write again.

Cost Analysis

Pricing Comparison

For Anthropic Claude models, typical pricing is:
Token TypeRelative CostExample (per 1M tokens)
Regular Input1x$3.00
Cache Write1.25x$3.75
Cache Read0.1x$0.30
Exact pricing varies by model. Check the Model Fortress page for current rates.

Advanced Patterns

Hybrid Caching

Cache different parts with different TTLs:
{
  "model": "anthropic/claude-opus-4-5",
  "input": [
    {
      "role": "system",
      "content": "Static company policies...",
      "cache_control": {
        "type": "ephemeral",
        "ttl": "1h" // Long TTL for static content
      }
    },
    {
      "role": "user",
      "content": "Recent conversation context...",
      "cache_control": {
        "type": "ephemeral",
        "ttl": "5m" // Short TTL for dynamic context
      }
    },
    {
      "role": "user",
      "content": "Current question"
    }
  ]
}

Conditional Caching

Only cache for specific use cases:
function buildInput(systemPrompt, userMessage, enableCaching) {
  const systemMessage = {
    role: "system",
    content: systemPrompt,
  };

  // Only add cache_control if beneficial
  if (enableCaching && systemPrompt.length > 1000) {
    systemMessage.cache_control = {
      type: "ephemeral",
      ttl: "1h",
    };
  }

  return [systemMessage, { role: "user", content: userMessage }];
}

Rotating Caches

For applications with multiple contexts, rotate caches strategically:
// Example: Customer support with different department contexts
const departments = ["sales", "support", "billing"];

async function handleCustomerQuery(department, query) {
  const departmentContext = await loadDepartmentContext(department);

  return await fetch("https://api.concentrate.ai/v1/responses", {
    method: "POST",
    headers: {
      /* ... */
    },
    body: JSON.stringify({
      model: "anthropic/claude-opus-4-5",
      input: [
        {
          role: "system",
          content: `Department: ${department}\n\n${departmentContext}`,
          cache_control: {
            type: "ephemeral",
            ttl: "1h", // Each department gets its own cache
          },
        },
        {
          role: "user",
          content: query,
        },
      ],
    }),
  });
}

Limitations

Current Limitations: - Only supported by Anthropic and AWS Bedrock providers (Claude models) - Maximum cache size varies by model - Caches are not guaranteed (infrastructure changes can invalidate) - No cross-user caching (caches are per API key)

Troubleshooting

Check:
  • Is the prefix content exactly identical to previous requests?
  • Has the TTL expired?
  • Are you using a supported provider (Anthropic or AWS Bedrock with Claude models)?
  • Is the cached content at the start of the input array?
Verify by checking cached_tokens in the response usage field.
Possible causes:
  • First request incurs cache write cost (1.25x input cost)
  • Cache misses due to changing content
  • TTL too short, causing frequent cache writes
Solution:
  • Use longer TTLs for stable content
  • Ensure content consistency
  • Monitor cache hit rates
Remember:
  • Caches are per API key, not global
  • Content must be exactly identical (including whitespace)
  • Cache can be invalidated by infrastructure changes
Always design your application to work without caching.

Request Parameters

Complete parameter reference

Create Response

Main endpoint documentation