explainx.ainewsletter3.4k
trendingπŸ”₯loopsskills
pricing
workshops β†—
explainx.ai

Learn to lead teams that combine humans and agents. Platform access, live workshops, bootcamps, and 50+ courses β€” plus skills, tools, and MCP to practice what you learn.

follow us

custom AI agents

[email protected]

get started

Join Β· $29/mo

learn

start for freepathwaysworkshopsbootcampscoursescertificationscertification testsexplainx universitycorporate trainingfacilitatorshackathonslearn skills & mcp

discover

skillstoolsagentsmcp serversdesignsllmsagiranks

content

releasesvisionmissionaboutcommunityteamcareersresourcespromptsgenerators hubgenerator SEO hubprompt templatesprompt guidesblogfor LLMsdemo

Sister Products

Infloq

Infloq

Influencer marketing

BgBlur

BgBlur

Privacy-first blur

Olly Social

Olly Social

Social AI copilot

Ceptory

Ceptory

Video intelligence

BgRemover

BgRemover

Background removal

newsletter Β· weekly

Get AI news, tools, and insights in your inbox.

contactsupportprivacytermsdata rightssubmission guidelines

Β© 2026 AISOLO Technologies Pvt Ltd

← Back to blog

explainx / blog

Token budget planning and execution: how to manage context costs in production AI systems in 2026

Token budget management is the financial planning layer of context engineering. This guide covers how to estimate, allocate, and monitor token budgets across context window components β€” system prompt, retrieval, history, and tool definitions β€” and how to optimize cost per task completion rather than cost per token.

Jun 28, 2026Β·10 min readΒ·Yash Thakker
Context engineeringLLM cost optimizationAI agentsToken budgetDeveloper tools
go deep
Token budget planning and execution: how to manage context costs in production AI systems in 2026

Token costs in production AI systems are almost never optimized. Teams launch with a rough understanding of API pricing, watch costs grow as usage scales, and start cutting corners (shorter outputs, smaller models) without diagnosing the root cause. Most of the time, the root cause is a context package with significant inefficiencies that no one measured.

This guide covers token budget management as a deliberate engineering practice: how to estimate costs before deployment, how to allocate the context window across components, how to use caching to reduce costs structurally, and how to monitor per-task cost to find where the budget is going.


Why tokens are both a cost and a quality concern

The obvious framing: tokens cost money. Input tokens are billed before the model generates anything. A 100k-token context window at $3 per million tokens (Claude claude-sonnet-4-6 pricing as of 2026) costs $0.30 per API call β€” before counting output tokens. For an agentic system making 20 API calls to complete a task, that's $6 in input tokens alone.

The less obvious framing: token allocation shapes quality. Every token in the context window competes for the model's attention. Attention research consistently shows that models distribute attention non-uniformly across long contexts β€” content at the beginning and end receives more weight than content in the middle. A 200k-token context window is not 200k tokens of equal attention; it's a budget that gets less effective as it fills with low-signal content.

High token count on low-signal content (outdated conversation turns, verbose tool schemas, irrelevant retrieved documents) both costs more and degrades quality. Token budget management is simultaneously cost engineering and quality engineering.


Pricing reference for 2026

Understanding your budget requires knowing the prices. As of mid-2026, the major frontier model APIs price input tokens as follows (approximate, check provider documentation for current pricing):

ModelInput (per M tokens)Cached input (per M tokens)Output (per M tokens)
Claude claude-sonnet-4-6$3.00$0.30$15.00
Claude claude-haiku-4-5$0.80$0.08$4.00
GPT-5.6 Terra$2.50$1.25$10.00
GPT-5.6 Luna (mini)$0.40$0.10$1.60
Gemini 3.5 Flash$0.15$0.04$0.60
Gemini 3.5 Pro$1.25$0.31$5.00

Two patterns are immediately visible:

  1. Output tokens cost 4-5x more than input tokens. This means controlling output length is often more valuable than optimizing input. But input tokens still matter at scale, especially for agentic systems with large context packages and many API calls per task.

  2. Cached input tokens cost 80-90% less than uncached. This is the single largest cost reduction available without changing model behavior β€” and it's free if you structure your context correctly.


The context budget model

Before building a system, model its token budget. This is the equivalent of capacity planning β€” you want to know where tokens go before you start seeing unexpectedly large bills.

Fixed costs (per session)

These don't change across API calls within a session:

ComponentTypical range
System prompt200–800 tokens
Tool definitions (per tool)100–400 tokens
Safety instructions50–200 tokens
Persistent context (CLAUDE.md equivalent)500–2,000 tokens

Example: A system with a 400-token system prompt, 8 tools averaging 200 tokens each, and 200 tokens of safety instructions has fixed costs of 400 + 1,600 + 200 = 2,200 tokens. These are natural cache candidates.

Variable costs (per API call)

These change with each call:

ComponentTypical range
Conversation history0–50,000+ tokens (grows with session)
Retrieved documents (per chunk)300–1,000 tokens
Tool outputs (per call)100–5,000 tokens
Current user message50–500 tokens

Example: At turn 15 of an agent session, with 5 retrieved chunks of 500 tokens each, 10,000 tokens of accumulated history, and a 200-token user message, variable costs are 2,500 + 10,000 + 200 = 12,700 tokens. Total context: 2,200 (fixed) + 12,700 (variable) = 14,900 tokens.

Across 15 turns, with this model, input token spend per session approaches: 15 Γ— (2,200 fixed + growing variable average) β€” roughly 200,000 tokens for a complete session. At $3/million, that's $0.60 per session in input tokens. With caching on the 2,200 fixed tokens, that drops to roughly $0.42 per session.

The budget allocation target

Set a budget target for each component as a percentage of the total context window. A common allocation for a 32k-token context:

ComponentAllocationTokens
System prompt + tools15%~4,800
Retrieved context25%~8,000
Conversation history30%~9,600
Tool outputs (current turn)20%~6,400
User message + output buffer10%~3,200

Adjust this allocation based on your task type. Heavy retrieval tasks allocate more to retrieved context. Conversational tasks with long turn sequences allocate more to history.


Prompt caching: the highest-leverage optimization

Prompt caching is the most impactful token cost optimization available in 2026, and it requires no change to model behavior β€” only to how you structure the context.

How prompt caching works

You mark a prefix of your context as cacheable. On the first call, the full context is processed normally. On subsequent calls with the same cached prefix, the provider returns cached activations for the cached portion β€” you're billed at the cache read rate (10-20% of full price) instead of full input token price.

The key constraint: the cached prefix must be byte-for-byte identical across calls. Any change to the prefix invalidates the cache and triggers a cache write (usually billed at a small premium for the write itself).

What to cache

Cache the stable prefix β€” everything before the variable content:

  1. System prompt β€” stable across the session, ideal cache candidate
  2. Tool definitions β€” stable across the session, often the largest fixed cost
  3. Static reference content β€” project documentation, coding guidelines, product specs

Do not try to cache retrieved documents (they change per query) or conversation history (it accumulates).

Implementation with Anthropic's API

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Cache this block
        },
        {
            "type": "text", 
            "text": tool_definitions_as_text,
            "cache_control": {"type": "ephemeral"}  # Cache this block too
        }
    ],
    messages=conversation_history + [{"role": "user", "content": user_message}]
)

The cache_control blocks tell the API where to set the cache boundary. Everything in the system array before the first dynamic content is cacheable; everything after is not.

Cache TTL and cost

Anthropic's ephemeral cache has a 5-minute TTL. For agentic sessions where API calls happen within 5 minutes of each other (the typical case), cache hits are near-100%. For batch processing with long delays between calls, the cache TTL may expire, causing more cache misses.

The cache write fee (typically 25% of full input price for a one-time write) amortizes over the session. If you make 10+ API calls with the same system prompt and tools, the cache write cost is negligible compared to savings.


Monitoring per-task token cost

Cost per API call is an incomplete metric. An agent task that completes in 5 API calls and 30,000 tokens is more efficient than one that completes in 20 API calls and 150,000 tokens β€” even if the per-call costs look reasonable.

What to log

For every agent task, log:

  • Total input tokens across all API calls in the task
  • Total output tokens across all API calls in the task
  • Total API calls made
  • Cache hit rate (cached tokens vs. uncached input tokens)
  • Task outcome (success, partial success, failure)

Calculate: total cost = (uncached_input_tokens Γ— input_price) + (cached_input_tokens Γ— cache_price) + (output_tokens Γ— output_price)

Warning signals

Rising cost per task without rising task complexity. If a benchmark task costs 2x more to complete at month 2 than month 1 (without changing the task), something in the context package is growing. Usually: conversation history accumulation, increasingly verbose tool definitions added over time, or growing system prompts.

Low cache hit rate on fixed costs. If your system prompt and tool definitions aren't caching effectively, check whether they're actually identical across calls (string interpolation or dynamic generation will break the cache).

High token cost on failed tasks. If your agent spends 50,000 tokens attempting a task before failing, those tokens cost the same as a successful task. Track failure cost separately β€” high failure cost often indicates a context package that leads the model into dead ends.

Disproportionate output token cost. If output tokens are 3-4x your input tokens, the model is generating very long responses. This is often a system prompt issue β€” add explicit output length constraints.


Practical optimizations in order of impact

1. Enable prompt caching (immediate, no quality change)

The highest-ROI optimization. Adds a few lines of code; reduces input token cost on stable prefixes by 80-90%. Do this first.

2. Implement conversation history management

Without summarization or a sliding window, conversation history grows unboundedly and becomes the largest cost driver in long sessions. Implement summarization for sessions expected to exceed 15-20 turns.

3. Minimize tool surface

Every tool definition that's not needed for the current task is wasted tokens. Route tasks to tool subsets based on task type. For a system with 20 tools, exposing 5 relevant tools per task type rather than all 20 reduces tool definition token cost by 75%.

4. Optimize retrieval

Reduce k (number of retrieved chunks), raise the relevance threshold to drop low-scoring results, and trim chunk sizes for high-density documents. Test whether retrieval quality degrades β€” often reducing k from 10 to 5 loses nothing in quality while halving retrieval token cost.

5. Tighten the system prompt

Review your system prompt line by line. Remove background narrative (the model doesn't need history lessons β€” it needs instructions). Remove redundant statements of the same constraint. Remove examples that don't improve behavior measurably. Well-engineered system prompts are often 30-50% shorter than first drafts with no quality loss.

6. Right-size the model

For tasks that don't require frontier model capability, run on a faster, cheaper model. Claude Haiku 4.5 is ~3.75x cheaper than Sonnet 4.6 on input and output. For retrieval, summarization, and classification sub-tasks within an agentic workflow, Haiku often matches Sonnet at a fraction of the cost. Use a tiered model strategy: route simple sub-tasks to the cheaper model, complex reasoning to the more capable one.


Token budget checklist before deployment

Before launching a production agentic system:

  • Modeled the token budget for each context component (fixed and variable)
  • Set allocation targets (% per component) for a target context window
  • Prompt caching enabled and verified (check cache hit rate in logs)
  • Tool surface limited to task-relevant tools per task type
  • Conversation history management strategy implemented and tested
  • Output length constrained in system prompt
  • Cost per task logged (not just cost per call)
  • Alert threshold set for cost per task (to detect regressions)
  • Model tiering considered for sub-tasks that don't need frontier capability

Token budget management is not a one-time setup β€” it's an ongoing monitoring practice. Costs drift as systems evolve: system prompts grow, tool definitions get added, history management lapses. Treat cost per task as a quality metric and watch it the same way you watch latency and error rate.

Related posts

Jun 28, 2026

Context engineering vs prompt engineering: a precise distinction for 2026

Prompt engineering fixes your wording. Context engineering fixes what the model sees. This guide draws the precise line, shows concrete examples of each in action, and maps out when to reach for which tool.

Jun 28, 2026

RAG and context injection: designing retrieval pipelines that actually work in 2026

RAG is not just a retrieval problem β€” it's a context engineering problem. What you retrieve, how you inject it, and where it lives in the context window determines whether the model can actually use it. This guide covers the full pipeline from chunking to injection.

Jun 28, 2026

Tool definition and schema design: the context engineering layer most teams get wrong in 2026

Bad tool definitions cause more agent failures than bad retrieval or bad prompts. This guide covers how to write tool schemas and descriptions that produce reliable tool calls β€” and how to minimize your tool surface so the model picks the right tool every time.