What is a context window in a large language model?

The context window (or context length) is the maximum number of tokens the model can use in a single request for all inputs plus the new completion—subject to the API’s exact split between prompt budget and max output. It is a hard cap for 'what fits in one forward pass' for that model, not a measure of how many parameters the model has.

Is the context window the same as the maximum output length?

Not exactly. 'Context window' is often stated as a total (e.g. 1M tokens) that must include your prompt, prior turns, tool payloads, and the assistant’s new tokens up to a max output limit. The exact accounting depends on the provider; always read the model page for that API.

Why do providers charge more or throttle above certain context sizes?

Long sequences increase memory and compute for attention; many vendors use tiered rate limits, priority lanes, or different pricing for very long requests. As of 2026, OpenAI documents separate handling for very long GPT-5.4 prompts. Model size in billions of parameters is a different question—see /blog/llm-model-parameters-billions-explained.

What should I do when my conversation exceeds the window?

Summarization, retrieval (RAG) instead of pasting full corpora, structured tool outputs, or vendor features like message compaction. Anthropic documents context management strategies in Claude API docs under context windows; OpenAI documents compaction for long agent trajectories in GPT-5.4 guidance.

Where is the relationship between tokens and billing explained?

See /blog/what-are-llm-tokens for what tokens are; /blog/caveman-token-compression for how output-heavy agents and caching change bills.

What is a context window? LLM 'working memory' and a 2026 snapshot of top models | explainx.ai Blog

The context window (context length) is the token budget a single request can use: the full prompt you send (system text, user message, prior turns, tool definitions, retrievals, tool results) plus the space allowed for the new completion, up to a max output cap. It is a limit on how much the model can attend to at once, not the same as parameter count or how usage is metered in tokens—though all three appear on the same pricing pages.

If the request does not fit, the system may return an error, truncate older content, or compress it—behavior depends on the client and API.

How vendors talk about it

“1M context” (typical 2026 flagship claim) is usually a ceiling on what you can pass in for one call, while max new tokens in the response may still be smaller (for example 64k–128k on sync APIs for some front-tier models).
“Max output tokens” is the cap on the assistant portion of the response for that call.
Long-context products often have separate rate limits or price for very long sessions. OpenAI’s GPT-5.4 guide discusses 272k-scale thresholds for rate limits and pricing; re-read the current pricing page before sizing a workload.

The table below is a 2026 documentation snapshot—always re-check the model page in your environment (direct API, Bedrock, Vertex, Foundry, etc.).

Snapshot: context and max output (per published docs)

Model (line)	Context window (tokens)	Max output (typical)	Where to confirm
Claude Opus 4.7	1,000,000 (1M)	128,000 (sync Messages)	Anthropic models overview
Claude Sonnet 4.6	1,000,000 (1M)	64,000	same
Claude Haiku 4.5	200,000	64,000	same
GPT-5.4 (API)	1M long-context; see guide for 272K vs 1M behavior	up to 128,000 (see model page)	Latest model, GPT-5.4
Gemini 3.1 Pro (Preview)	1,000,000 input (1M)	on the order of 64k output (varies)	Google AI dev docs, DeepMind
Llama 4 Maverick (open)	1,000,000 (1M) in model card	server-dependent	Meta model card
Llama 4 Scout (open)	10,000,000 (10M) in model card	server-dependent	same

Caveats: a large on-paper window does not force you to fill it; latency and cost usually grow with effective length. Multimodal inputs (images, video) use additional or separate budgets on many APIs. Hosted marketplaces can impose tighter caps than the base model card.

Why “million-token” class models are expensive to run

Standard transformer attention is heavy in memory for long sequences, so providers invest in kernels, sparsity, chunking, and user-facing compaction to make 1M-class products usable. That engineering is a major reason for long-context tiers and pricing rules—not only marketing.

Working inside a finite window

RAG and tools: pass pointers and retrieved chunks instead of whole corpora in every turn when possible (MCP patterns help).
Vendor compaction: Anthropic context window and OpenAI GPT-5.4 guidance both emphasize managing state rather than infinite raw history.
Shorter high-signal output: Caveman / brevity in agents reduces how fast the next turn’s input balloons.

What is a context window? LLM 'working memory' and a 2026 snapshot of top models

How vendors talk about it

Snapshot: context and max output (per published docs)

Why “million-token” class models are expensive to run

Working inside a finite window

Read next

Related posts

What are parameters in a large language model? Billions, MoE, and what 2026 model cards really say

What are tokens? A plain guide to how LLMs count (and charge for) text

Anthropic Project Deal: Claude AI Agents Negotiate 186 Deals in Office Marketplace Experiment