If you have ever read a doc that says “32k context” or “$2.50 per million input tokens” and only half-trusted your mental model, this article is the missing layer: what a token is, why providers count them, and how that connects to limits, bills, and rate limits.
Scope: this is a concepts guide. For dollar math, prompt caching, and agent pipelines, read Caveman skill: token economics and API pricing next.
Tokens are not the same as words
In daily language we count words. Under the hood, a large language model consumes a sequence of tokens: integer IDs from a fixed vocabulary, produced by a tokenizer (families you will see in papers include BPE, WordPiece, and vendor-specific schemes).
- A token can be a short whole word (e.g.
hellomight be one token). - A token can be a subword — long or rare strings are often split into several pieces.
- Punctuation, spaces, and code are also encoded as one or more tokens. Code and JSON are often longer in token count than a casual glance suggests, because braces, semicolons, and indentation are all billed like anything else.
Why it matters: a “short” line in the editor can still be thousands of tokens once the app attaches system instructions, open files, tool schemas, and prior turns.
Heuristics (English prose, ballpark only): people often use ~4 characters per token, or on the order of one token per ¾ of a word. Do not use heuristics for billing—use the provider’s tokenizer or usage dashboard for the model you run.
Input vs output tokens
| Kind | What counts | Intuition |
|---|---|---|
| Input (prompt) tokens | System prompt, your message, full chat history the client sends, retrieved documents, tool parameters and tool results, images (often a separate budget), etc. | Everything the model must read to respond. |
| Output (completion) tokens | The model’s generated text (and sometimes separate billed fields, depending on product). | Everything the model writes. |
Two common surprises:
- “I only typed one sentence.” The service may still include all prior turns and in-scope files in the request—input can be huge compared to your last line.
- Long replies compound: output tokens in turn become input on the next turn, so verbosity in chat and agent loops can inflate both sides of the ledger.
On frontier models, output is often priced higher per token than input—see each vendor’s rate card (e.g. OpenAI, Anthropic).
Context window: how many tokens fit in one go
The context window (e.g. 128k or 1M in marketing tables) is the maximum combined budget the model is built to process in a single request: your input plus the room reserved for the reply (how the split is defined depends on the API—read the spec for your model).
- If you exceed the limit, the system may error, truncate early content, or summarize—behavior is not uniform across products.
- A larger window is not a free pass: it means bigger prompts are possible, which can mean higher API cost or faster burn through subscription credits if the app sends whole trees or long histories by default.
Why billing uses tokens (not pages or words)
- The model is literally trained and served as a function over token sequences—that is the native interface to the stack.
- Token count tracks compute and memory use more consistently than “words” across languages, markup, and code.
- Vendors can publish a single table—$/million input and $/million output—that scales with workload size.
You can still plan in paragraphs and files; the invoice will still speak in tokens.
“Cached” input (one paragraph)
Some APIs discount long unchanged prefixes of a prompt when they qualify for cached or reused input (rules differ by provider). The idea: if most of an agent’s prompt is a stable system block plus tool definitions, you pay less for that slice on the next call when caching hits. For production patterns, see the Caveman post and your vendor’s prompt caching documentation.
Subscriptions vs APIs
- Chat and IDE products often show “messages” or a single usage meter. Underneath, that still maps to model calls and token-like budgets you may not see line by line.
- API usage pages usually show per-request or per-month token totals, which is closer to marginal cost modeling for an app you ship.
Either way, the scarce resource in aggregate is tokens over time (and provider capacity), which is where rate limits and plan tiers come from.
Practical habits
- Measure with your real stack: provider usage APIs, IDE panels, or token counters in CI.
- Trim what you add to every turn—large readmes and logs belong behind retrieval or on-demand file reads, not by default in global context, unless you truly need them every time.
- Prefer structured, reusable instructions (agent skills and templates) over pasting the same long preamble each session.
Read next
- Caveman skill: token economics, API pricing, and cutting verbose output
- What is a context window? (2026 model snapshot)
- What are parameters in an LLM?
- What is MCP? Model Context Protocol
- Claude Opus 4.7: models, limits, and pricing
Tokenizer behavior and plan limits are vendor- and model-specific; always read the current documentation for the product you use.