Documentation Index
Fetch the complete documentation index at: https://concentrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Prompt caching allows you to cache portions of your prompts that are reused across multiple requests, significantly reducing costs and improving performance. This is especially valuable for applications with large system prompts, extensive context, or documentation.Availability: Currently supported by:
- Anthropic provider (Claude models via Anthropic API)
- AWS Bedrock provider (Claude models via AWS Bedrock)
How It Works
Prompt caching works by storing specific message content and reusing it across requests:- First Request: You mark messages with
cache_controlsettings - Cache Write: The API writes those messages to a cache
- Subsequent Requests: Identical cached content is retrieved instead of processed
- Cache Expiry: Cache expires after the specified time-to-live (TTL)
Benefits
Cost Savings
Cost Savings
- Cache read tokens: ~90% cheaper than regular input tokens
- Ideal for: Large system prompts, documentation, repeated context
- Break-even: Typically after 2-3 requests with the same prefix
Improved Performance
Improved Performance
- Reduced processing time for cached content - Faster response generation - Lower latency for requests with cached prefixes
Consistent Context
Consistent Context
- Ensures identical system prompts across requests
- Maintains consistency in multi-turn conversations
- Simplifies prompt management
Basic Usage
Addcache_control to any message in your input array:
Cache Control Parameters
Type of cache to use.Currently only
"ephemeral" is supported, which means:- Cache is temporary and will expire
- Not persisted across API restarts
- Shared across your API key’s requests
Time-to-live for the cached content.Options:
"5m" (5 minutes) or "1h" (1 hour)Use "5m" for rapid successive requests and real-time conversations (e.g., chat applications).Use "1h" for regular usage patterns and batch processing (e.g., document analysis sessions).After the TTL expires, the next request will perform a cache write again.Cost Analysis
Pricing Comparison
For Anthropic Claude models, typical pricing is:| Token Type | Relative Cost | Example (per 1M tokens) |
|---|---|---|
| Regular Input | 1x | $3.00 |
| Cache Write | 1.25x | $3.75 |
| Cache Read | 0.1x | $0.30 |
Exact pricing varies by model. Check the Model Fortress page for current rates.
Advanced Patterns
Hybrid Caching
Cache different parts with different TTLs:Conditional Caching
Only cache for specific use cases:Rotating Caches
For applications with multiple contexts, rotate caches strategically:Limitations
Troubleshooting
Cache not being used
Cache not being used
Check:
- Is the prefix content exactly identical to previous requests?
- Has the TTL expired?
- Are you using a supported provider (Anthropic or AWS Bedrock with Claude models)?
- Is the cached content at the start of the input array?
cached_tokens in the response usage field.Higher costs than expected
Higher costs than expected
Possible causes:
- First request incurs cache write cost (1.25x input cost)
- Cache misses due to changing content
- TTL too short, causing frequent cache writes
- Use longer TTLs for stable content
- Ensure content consistency
- Monitor cache hit rates
Unexpected cache behavior
Unexpected cache behavior
Remember:
- Caches are per API key, not global
- Content must be exactly identical (including whitespace)
- Cache can be invalidated by infrastructure changes
Related Documentation
Request Parameters
Complete parameter reference
Create Response
Main endpoint documentation