What is a token in an LLM?

A token is a chunk of text from the model’s vocabulary—often a subword piece produced by a tokenizer (such as BPE). The model does not “see” characters or words directly; it sees a sequence of token IDs. A short word might be one token; a long or rare word might be several. Punctuation, spaces, and code symbols are also split into one or more tokens.

How many tokens are in a word?

There is no fixed ratio. In English, a rough rule of thumb is about 0.75 words per token (so roughly 1.3 tokens per word on average), but it varies by language, style, and whether the text is code, JSON, or prose. The only reliable approach is to count with the tokenizer your provider or app exposes.

What is the difference between input tokens and output tokens?

Input (or prompt) tokens are everything you send: system prompt, user message, tool definitions, prior turns, and retrieved documents. Output (or completion) tokens are what the model generates. Providers often price output higher per token than input because generation is computationally different and to discourage overly long answers.

What is a context window or context length?

The context window is the maximum number of tokens the model can consider in a single forward pass for that request—input plus the space reserved for the reply. If your conversation plus instructions exceed the limit, you must trim, summarize, or start a new thread. Larger windows are not “free”; they increase compute and are often priced or tiered by plan.

Why do apps say I am “out of usage” if tokens are cheap?

Consumer plans usually wrap an allowance (messages, credits, or soft caps) that maps internally to model calls and token budgets. A single long session with a big file or many tool calls can consume many tokens quickly. For production, APIs bill tokens (and sometimes tools) explicitly—see our token economics follow-up: /blog/caveman-token-compression.

What are tokens? A plain guide to how LLMs count (and charge for) text | explainx.ai Blog

If you have ever read a doc that says “32k context” or “$2.50 per million input tokens” and only half-trusted your mental model, this article is the missing layer: what a token is, why providers count them, and how that connects to limits, bills, and rate limits.

Scope: this is a concepts guide. For dollar math, prompt caching, and agent pipelines, read Caveman skill: token economics and API pricing next.

Tokens are not the same as words

In daily language we count words. Under the hood, a large language model consumes a sequence of tokens: integer IDs from a fixed vocabulary, produced by a tokenizer (families you will see in papers include BPE, WordPiece, and vendor-specific schemes).

A token can be a short whole word (e.g. hello might be one token).
A token can be a subword — long or rare strings are often split into several pieces.
Punctuation, spaces, and code are also encoded as one or more tokens. Code and JSON are often longer in token count than a casual glance suggests, because braces, semicolons, and indentation are all billed like anything else.

Why it matters: a “short” line in the editor can still be thousands of tokens once the app attaches system instructions, open files, tool schemas, and prior turns.

Heuristics (English prose, ballpark only): people often use ~4 characters per token, or on the order of one token per ¾ of a word. Do not use heuristics for billing—use the provider’s tokenizer or usage dashboard for the model you run.

Input vs output tokens

Kind	What counts	Intuition
Input (prompt) tokens	System prompt, your message, full chat history the client sends, retrieved documents, tool parameters and tool results, images (often a separate budget), etc.	Everything the model must read to respond.
Output (completion) tokens	The model’s generated text (and sometimes separate billed fields, depending on product).	Everything the model writes.

Two common surprises:

“I only typed one sentence.” The service may still include all prior turns and in-scope files in the request—input can be huge compared to your last line.
Long replies compound: output tokens in turn become input on the next turn, so verbosity in chat and agent loops can inflate both sides of the ledger.

On frontier models, output is often priced higher per token than input—see each vendor’s rate card (e.g. OpenAI, Anthropic).

Context window: how many tokens fit in one go

The context window (e.g. 128k or 1M in marketing tables) is the maximum combined budget the model is built to process in a single request: your input plus the room reserved for the reply (how the split is defined depends on the API—read the spec for your model).

If you exceed the limit, the system may error, truncate early content, or summarize—behavior is not uniform across products.
A larger window is not a free pass: it means bigger prompts are possible, which can mean higher API cost or faster burn through subscription credits if the app sends whole trees or long histories by default.

Why billing uses tokens (not pages or words)

The model is literally trained and served as a function over token sequences—that is the native interface to the stack.
Token count tracks compute and memory use more consistently than “words” across languages, markup, and code.
Vendors can publish a single table—$/million input and $/million output—that scales with workload size.

You can still plan in paragraphs and files; the invoice will still speak in tokens.

“Cached” input (one paragraph)

Some APIs discount long unchanged prefixes of a prompt when they qualify for cached or reused input (rules differ by provider). The idea: if most of an agent’s prompt is a stable system block plus tool definitions, you pay less for that slice on the next call when caching hits. For production patterns, see the Caveman post and your vendor’s prompt caching documentation.

Subscriptions vs APIs

Chat and IDE products often show “messages” or a single usage meter. Underneath, that still maps to model calls and token-like budgets you may not see line by line.
API usage pages usually show per-request or per-month token totals, which is closer to marginal cost modeling for an app you ship.

Either way, the scarce resource in aggregate is tokens over time (and provider capacity), which is where rate limits and plan tiers come from.

Practical habits

Measure with your real stack: provider usage APIs, IDE panels, or token counters in CI.
Trim what you add to every turn—large readmes and logs belong behind retrieval or on-demand file reads, not by default in global context, unless you truly need them every time.
Prefer structured, reusable instructions (agent skills and templates) over pasting the same long preamble each session.

What are tokens? A plain guide to how LLMs count (and charge for) text

Tokens are not the same as words

Input vs output tokens

Context window: how many tokens fit in one go

Why billing uses tokens (not pages or words)

“Cached” input (one paragraph)

Subscriptions vs APIs

Practical habits

Read next

Related posts

What is a context window? LLM 'working memory' and a 2026 snapshot of top models

What are parameters in a large language model? Billions, MoE, and what 2026 model cards really say

Why do AI models hallucinate? A practical guide (with Anthropic’s explainer and ExplainX tips)