Why caching matters
Imagine you are building a customer service bot. Every conversation carries an 8000-token system prompt — company info, FAQs, product catalog.
Without caching:
- Each conversation input: 8000 tokens × ¥0.015 / 1K = ¥0.12
- 10,000 conversations per day: ¥1200 spent purely on the system prompt
That is clearly too expensive. Anthropic's prompt caching is built for exactly this scenario.
The three token types
With caching enabled, Anthropic splits your request into three kinds of tokens:
| Type | Meaning | Rate (vs normal input) |
|---|---|---|
| Regular input | Uncached portion, usually the latest user question | 1x (normal price) |
| cache_creation | First-time write into the cache, usually the system prompt | 1.25x (slightly more expensive) |
| cache_read | Subsequent hits pulled from the cache | 0.1x (one-tenth of normal) |
The billing formula
Tokensmart charges you exactly as Anthropic prices:
total_cost =
regular_input × input_price +
cache_read × cache_read_price +
cache_creation × cache_creation_price +
output × output_price
Back to the example
System prompt = 8000 tokens, user question = 100 tokens, response = 300 tokens:
First conversation (cold cache):
- cache_creation: 8000 × ¥0.01875 / 1K = ¥0.15
- regular_input: 100 × ¥0.015 / 1K = ¥0.0015
- output: 300 × ¥0.075 / 1K = ¥0.0225
Conversations 2 through 10,000 (cache hit):
- cache_read: 8000 × ¥0.0015 / 1K = ¥0.012 ← one-tenth of normal
- regular_input: 100 × ¥0.015 / 1K = ¥0.0015
- output: 300 × ¥0.075 / 1K = ¥0.0225
Daily total for the system prompt drops from ¥1200 to around ¥120 — a flat 10x savings.
Gotchas
- 5-minute TTL: Anthropic's cache lives for 5 minutes of idle time. Only high-frequency traffic benefits
- 1024-token minimum: System prompts shorter than 1024 tokens will not be cached
- Must opt in explicitly: Add
cache_control: { type: "ephemeral" }to your request, otherwise caching is off
How to see it in Tokensmart logs
Open API logs. Every row has dedicated cache_read and cache_creation columns, and the hover tooltip shows the token × rate × subtotal breakdown.