Prompt Caching

Overview

Prompt caching allows you to cache portions of your prompts that are reused across multiple requests, significantly reducing costs and improving performance. This is especially valuable for applications with large system prompts, extensive context, or documentation.

Availability: Currently supported by:

Anthropic provider (Claude models via Anthropic API)
AWS Bedrock provider (Claude models via AWS Bedrock)

All other providers will ignore cache control settings.

How It Works

Prompt caching works by storing specific message content and reusing it across requests:

First Request: You mark messages with cache_control settings
Cache Write: The API writes those messages to a cache
Subsequent Requests: Identical cached content is retrieved instead of processed
Cache Expiry: Cache expires after the specified time-to-live (TTL)

Benefits

Cost Savings

Cache read tokens: ~90% cheaper than regular input tokens
Ideal for: Large system prompts, documentation, repeated context
Break-even: Typically after 2-3 requests with the same prefix

Improved Performance

Reduced processing time for cached content - Faster response generation - Lower latency for requests with cached prefixes

Consistent Context

Ensures identical system prompts across requests
Maintains consistency in multi-turn conversations
Simplifies prompt management

Basic Usage

Add cache_control to any message in your input array:

{
  "model": "anthropic/claude-opus-4-5",
  "input": [
    {
      "role": "system",
      "content": "Very long system prompt with detailed instructions...",
      "cache_control": {
        "type": "ephemeral",
        "ttl": "5m"
      }
    },
    {
      "role": "user",
      "content": "User question here"
    }
  ]
}

Cache Control Parameters

cache_control.type

string

required

Type of cache to use.Currently only "ephemeral" is supported, which means:

Cache is temporary and will expire
Not persisted across API restarts
Shared across your API key’s requests

cache_control.ttl

string

required

Time-to-live for the cached content.Options: "5m" (5 minutes) or "1h" (1 hour)Use "5m" for rapid successive requests and real-time conversations (e.g., chat applications).Use "1h" for regular usage patterns and batch processing (e.g., document analysis sessions).After the TTL expires, the next request will perform a cache write again.

Cost Analysis

Pricing Comparison

For Anthropic Claude models, typical pricing is:

Token Type	Relative Cost	Example (per 1M tokens)
Regular Input	1x	$3.00
Cache Write	1.25x	$3.75
Cache Read	0.1x	$0.30

Exact pricing varies by model. Check the Model Fortress page for current rates.

Advanced Patterns

Hybrid Caching

Cache different parts with different TTLs:

{
  "model": "anthropic/claude-opus-4-5",
  "input": [
    {
      "role": "system",
      "content": "Static company policies...",
      "cache_control": {
        "type": "ephemeral",
        "ttl": "1h" // Long TTL for static content
      }
    },
    {
      "role": "user",
      "content": "Recent conversation context...",
      "cache_control": {
        "type": "ephemeral",
        "ttl": "5m" // Short TTL for dynamic context
      }
    },
    {
      "role": "user",
      "content": "Current question"
    }
  ]
}

Conditional Caching

Only cache for specific use cases:

function buildInput(systemPrompt, userMessage, enableCaching) {
  const systemMessage = {
    role: "system",
    content: systemPrompt,
  };

  // Only add cache_control if beneficial
  if (enableCaching && systemPrompt.length > 1000) {
    systemMessage.cache_control = {
      type: "ephemeral",
      ttl: "1h",
    };
  }

  return [systemMessage, { role: "user", content: userMessage }];
}

Rotating Caches

For applications with multiple contexts, rotate caches strategically:

// Example: Customer support with different department contexts
const departments = ["sales", "support", "billing"];

async function handleCustomerQuery(department, query) {
  const departmentContext = await loadDepartmentContext(department);

  return await fetch("https://api.concentrate.ai/v1/responses", {
    method: "POST",
    headers: {
      /* ... */
    },
    body: JSON.stringify({
      model: "anthropic/claude-opus-4-5",
      input: [
        {
          role: "system",
          content: `Department: ${department}\n\n${departmentContext}`,
          cache_control: {
            type: "ephemeral",
            ttl: "1h", // Each department gets its own cache
          },
        },
        {
          role: "user",
          content: query,
        },
      ],
    }),
  });
}

Limitations

Current Limitations: - Only supported by Anthropic and AWS Bedrock providers (Claude models) - Maximum cache size varies by model - Caches are not guaranteed (infrastructure changes can invalidate) - No cross-user caching (caches are per API key)

Troubleshooting

Cache not being used

Check:

Is the prefix content exactly identical to previous requests?
Has the TTL expired?
Are you using a supported provider (Anthropic or AWS Bedrock with Claude models)?
Is the cached content at the start of the input array?

Verify by checking cached_tokens in the response usage field.

Higher costs than expected

Possible causes:

First request incurs cache write cost (1.25x input cost)
Cache misses due to changing content
TTL too short, causing frequent cache writes

Solution:

Use longer TTLs for stable content
Ensure content consistency
Monitor cache hit rates

Unexpected cache behavior

Remember:

Caches are per API key, not global
Content must be exactly identical (including whitespace)
Cache can be invalidated by infrastructure changes

Always design your application to work without caching.

Request Parameters

Complete parameter reference

Create Response

Main endpoint documentation

API documentation

Responses

Chat Completions (Beta)

Messages (Beta)

Models

Utilities

Features

Reference

Overview

How It Works

Benefits

Basic Usage

Cache Control Parameters

Cost Analysis

Pricing Comparison

Advanced Patterns

Hybrid Caching

Conditional Caching

Rotating Caches

Limitations

Troubleshooting

Request Parameters

Create Response

API documentation

Responses

Chat Completions (Beta)

Messages (Beta)

Models

Utilities

Features

Reference

Documentation Index

​Overview

​How It Works

​Benefits

​Basic Usage

​Cache Control Parameters

​Cost Analysis

​Pricing Comparison

​Advanced Patterns

​Hybrid Caching

​Conditional Caching

​Rotating Caches

​Limitations

​Troubleshooting

​Related Documentation

Request Parameters

Create Response

Overview

How It Works

Benefits

Basic Usage

Cache Control Parameters

Cost Analysis

Pricing Comparison

Advanced Patterns

Hybrid Caching

Conditional Caching

Rotating Caches

Limitations

Troubleshooting

Related Documentation