Token costs in production AI systems are almost never optimized. Teams launch with a rough understanding of API pricing, watch costs grow as usage scales, and start cutting corners (shorter outputs, smaller models) without diagnosing the root cause. Most of the time, the root cause is a context package with significant inefficiencies that no one measured.
This guide covers token budget management as a deliberate engineering practice: how to estimate costs before deployment, how to allocate the context window across components, how to use caching to reduce costs structurally, and how to monitor per-task cost to find where the budget is going.
Why tokens are both a cost and a quality concern
The obvious framing: tokens cost money. Input tokens are billed before the model generates anything. A 100k-token context window at $3 per million tokens (Claude claude-sonnet-4-6 pricing as of 2026) costs $0.30 per API call β before counting output tokens. For an agentic system making 20 API calls to complete a task, that's $6 in input tokens alone.
The less obvious framing: token allocation shapes quality. Every token in the context window competes for the model's attention. Attention research consistently shows that models distribute attention non-uniformly across long contexts β content at the beginning and end receives more weight than content in the middle. A 200k-token context window is not 200k tokens of equal attention; it's a budget that gets less effective as it fills with low-signal content.
High token count on low-signal content (outdated conversation turns, verbose tool schemas, irrelevant retrieved documents) both costs more and degrades quality. Token budget management is simultaneously cost engineering and quality engineering.
Pricing reference for 2026
Understanding your budget requires knowing the prices. As of mid-2026, the major frontier model APIs price input tokens as follows (approximate, check provider documentation for current pricing):
| Model | Input (per M tokens) | Cached input (per M tokens) | Output (per M tokens) |
|---|---|---|---|
| Claude claude-sonnet-4-6 | $3.00 | $0.30 | $15.00 |
| Claude claude-haiku-4-5 | $0.80 | $0.08 | $4.00 |
| GPT-5.6 Terra | $2.50 | $1.25 | $10.00 |
| GPT-5.6 Luna (mini) | $0.40 | $0.10 | $1.60 |
| Gemini 3.5 Flash | $0.15 | $0.04 | $0.60 |
| Gemini 3.5 Pro | $1.25 | $0.31 | $5.00 |
Two patterns are immediately visible:
-
Output tokens cost 4-5x more than input tokens. This means controlling output length is often more valuable than optimizing input. But input tokens still matter at scale, especially for agentic systems with large context packages and many API calls per task.
-
Cached input tokens cost 80-90% less than uncached. This is the single largest cost reduction available without changing model behavior β and it's free if you structure your context correctly.
The context budget model
Before building a system, model its token budget. This is the equivalent of capacity planning β you want to know where tokens go before you start seeing unexpectedly large bills.
Fixed costs (per session)
These don't change across API calls within a session:
| Component | Typical range |
|---|---|
| System prompt | 200β800 tokens |
| Tool definitions (per tool) | 100β400 tokens |
| Safety instructions | 50β200 tokens |
| Persistent context (CLAUDE.md equivalent) | 500β2,000 tokens |
Example: A system with a 400-token system prompt, 8 tools averaging 200 tokens each, and 200 tokens of safety instructions has fixed costs of 400 + 1,600 + 200 = 2,200 tokens. These are natural cache candidates.
Variable costs (per API call)
These change with each call:
| Component | Typical range |
|---|---|
| Conversation history | 0β50,000+ tokens (grows with session) |
| Retrieved documents (per chunk) | 300β1,000 tokens |
| Tool outputs (per call) | 100β5,000 tokens |
| Current user message | 50β500 tokens |
Example: At turn 15 of an agent session, with 5 retrieved chunks of 500 tokens each, 10,000 tokens of accumulated history, and a 200-token user message, variable costs are 2,500 + 10,000 + 200 = 12,700 tokens. Total context: 2,200 (fixed) + 12,700 (variable) = 14,900 tokens.
Across 15 turns, with this model, input token spend per session approaches: 15 Γ (2,200 fixed + growing variable average) β roughly 200,000 tokens for a complete session. At $3/million, that's $0.60 per session in input tokens. With caching on the 2,200 fixed tokens, that drops to roughly $0.42 per session.
The budget allocation target
Set a budget target for each component as a percentage of the total context window. A common allocation for a 32k-token context:
| Component | Allocation | Tokens |
|---|---|---|
| System prompt + tools | 15% | ~4,800 |
| Retrieved context | 25% | ~8,000 |
| Conversation history | 30% | ~9,600 |
| Tool outputs (current turn) | 20% | ~6,400 |
| User message + output buffer | 10% | ~3,200 |
Adjust this allocation based on your task type. Heavy retrieval tasks allocate more to retrieved context. Conversational tasks with long turn sequences allocate more to history.
Prompt caching: the highest-leverage optimization
Prompt caching is the most impactful token cost optimization available in 2026, and it requires no change to model behavior β only to how you structure the context.
How prompt caching works
You mark a prefix of your context as cacheable. On the first call, the full context is processed normally. On subsequent calls with the same cached prefix, the provider returns cached activations for the cached portion β you're billed at the cache read rate (10-20% of full price) instead of full input token price.
The key constraint: the cached prefix must be byte-for-byte identical across calls. Any change to the prefix invalidates the cache and triggers a cache write (usually billed at a small premium for the write itself).
What to cache
Cache the stable prefix β everything before the variable content:
- System prompt β stable across the session, ideal cache candidate
- Tool definitions β stable across the session, often the largest fixed cost
- Static reference content β project documentation, coding guidelines, product specs
Do not try to cache retrieved documents (they change per query) or conversation history (it accumulates).
Implementation with Anthropic's API
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Cache this block
},
{
"type": "text",
"text": tool_definitions_as_text,
"cache_control": {"type": "ephemeral"} # Cache this block too
}
],
messages=conversation_history + [{"role": "user", "content": user_message}]
)
The cache_control blocks tell the API where to set the cache boundary. Everything in the system array before the first dynamic content is cacheable; everything after is not.
Cache TTL and cost
Anthropic's ephemeral cache has a 5-minute TTL. For agentic sessions where API calls happen within 5 minutes of each other (the typical case), cache hits are near-100%. For batch processing with long delays between calls, the cache TTL may expire, causing more cache misses.
The cache write fee (typically 25% of full input price for a one-time write) amortizes over the session. If you make 10+ API calls with the same system prompt and tools, the cache write cost is negligible compared to savings.
Monitoring per-task token cost
Cost per API call is an incomplete metric. An agent task that completes in 5 API calls and 30,000 tokens is more efficient than one that completes in 20 API calls and 150,000 tokens β even if the per-call costs look reasonable.
What to log
For every agent task, log:
- Total input tokens across all API calls in the task
- Total output tokens across all API calls in the task
- Total API calls made
- Cache hit rate (cached tokens vs. uncached input tokens)
- Task outcome (success, partial success, failure)
Calculate: total cost = (uncached_input_tokens Γ input_price) + (cached_input_tokens Γ cache_price) + (output_tokens Γ output_price)
Warning signals
Rising cost per task without rising task complexity. If a benchmark task costs 2x more to complete at month 2 than month 1 (without changing the task), something in the context package is growing. Usually: conversation history accumulation, increasingly verbose tool definitions added over time, or growing system prompts.
Low cache hit rate on fixed costs. If your system prompt and tool definitions aren't caching effectively, check whether they're actually identical across calls (string interpolation or dynamic generation will break the cache).
High token cost on failed tasks. If your agent spends 50,000 tokens attempting a task before failing, those tokens cost the same as a successful task. Track failure cost separately β high failure cost often indicates a context package that leads the model into dead ends.
Disproportionate output token cost. If output tokens are 3-4x your input tokens, the model is generating very long responses. This is often a system prompt issue β add explicit output length constraints.
Practical optimizations in order of impact
1. Enable prompt caching (immediate, no quality change)
The highest-ROI optimization. Adds a few lines of code; reduces input token cost on stable prefixes by 80-90%. Do this first.
2. Implement conversation history management
Without summarization or a sliding window, conversation history grows unboundedly and becomes the largest cost driver in long sessions. Implement summarization for sessions expected to exceed 15-20 turns.
3. Minimize tool surface
Every tool definition that's not needed for the current task is wasted tokens. Route tasks to tool subsets based on task type. For a system with 20 tools, exposing 5 relevant tools per task type rather than all 20 reduces tool definition token cost by 75%.
4. Optimize retrieval
Reduce k (number of retrieved chunks), raise the relevance threshold to drop low-scoring results, and trim chunk sizes for high-density documents. Test whether retrieval quality degrades β often reducing k from 10 to 5 loses nothing in quality while halving retrieval token cost.
5. Tighten the system prompt
Review your system prompt line by line. Remove background narrative (the model doesn't need history lessons β it needs instructions). Remove redundant statements of the same constraint. Remove examples that don't improve behavior measurably. Well-engineered system prompts are often 30-50% shorter than first drafts with no quality loss.
6. Right-size the model
For tasks that don't require frontier model capability, run on a faster, cheaper model. Claude Haiku 4.5 is ~3.75x cheaper than Sonnet 4.6 on input and output. For retrieval, summarization, and classification sub-tasks within an agentic workflow, Haiku often matches Sonnet at a fraction of the cost. Use a tiered model strategy: route simple sub-tasks to the cheaper model, complex reasoning to the more capable one.
Token budget checklist before deployment
Before launching a production agentic system:
- Modeled the token budget for each context component (fixed and variable)
- Set allocation targets (% per component) for a target context window
- Prompt caching enabled and verified (check cache hit rate in logs)
- Tool surface limited to task-relevant tools per task type
- Conversation history management strategy implemented and tested
- Output length constrained in system prompt
- Cost per task logged (not just cost per call)
- Alert threshold set for cost per task (to detect regressions)
- Model tiering considered for sub-tasks that don't need frontier capability
Token budget management is not a one-time setup β it's an ongoing monitoring practice. Costs drift as systems evolve: system prompts grow, tool definitions get added, history management lapses. Treat cost per task as a quality metric and watch it the same way you watch latency and error rate.