The context window (context length) is the token budget a single request can use: the full prompt you send (system text, user message, prior turns, tool definitions, retrievals, tool results) plus the space allowed for the new completion, up to a max output cap. It is a limit on how much the model can attend to at once, not the same as parameter count or how usage is metered in tokens—though all three appear on the same pricing pages.
If the request does not fit, the system may return an error, truncate older content, or compress it—behavior depends on the client and API.
How vendors talk about it
- “1M context” (typical 2026 flagship claim) is usually a ceiling on what you can pass in for one call, while max new tokens in the response may still be smaller (for example 64k–128k on sync APIs for some front-tier models).
- “Max output tokens” is the cap on the assistant portion of the response for that call.
- Long-context products often have separate rate limits or price for very long sessions. OpenAI’s GPT-5.4 guide discusses 272k-scale thresholds for rate limits and pricing; re-read the current pricing page before sizing a workload.
The table below is a 2026 documentation snapshot—always re-check the model page in your environment (direct API, Bedrock, Vertex, Foundry, etc.).
Snapshot: context and max output (per published docs)
| Model (line) | Context window (tokens) | Max output (typical) | Where to confirm |
|---|---|---|---|
| Claude Opus 4.7 | 1,000,000 (1M) | 128,000 (sync Messages) | Anthropic models overview |
| Claude Sonnet 4.6 | 1,000,000 (1M) | 64,000 | same |
| Claude Haiku 4.5 | 200,000 | 64,000 | same |
| GPT-5.4 (API) | 1M long-context; see guide for 272K vs 1M behavior | up to 128,000 (see model page) | Latest model, GPT-5.4 |
| Gemini 3.1 Pro (Preview) | 1,000,000 input (1M) | on the order of 64k output (varies) | Google AI dev docs, DeepMind |
| Llama 4 Maverick (open) | 1,000,000 (1M) in model card | server-dependent | Meta model card |
| Llama 4 Scout (open) | 10,000,000 (10M) in model card | server-dependent | same |
Caveats: a large on-paper window does not force you to fill it; latency and cost usually grow with effective length. Multimodal inputs (images, video) use additional or separate budgets on many APIs. Hosted marketplaces can impose tighter caps than the base model card.
Why “million-token” class models are expensive to run
Standard transformer attention is heavy in memory for long sequences, so providers invest in kernels, sparsity, chunking, and user-facing compaction to make 1M-class products usable. That engineering is a major reason for long-context tiers and pricing rules—not only marketing.
Working inside a finite window
- RAG and tools: pass pointers and retrieved chunks instead of whole corpora in every turn when possible (MCP patterns help).
- Vendor compaction: Anthropic context window and OpenAI GPT-5.4 guidance both emphasize managing state rather than infinite raw history.
- Shorter high-signal output: Caveman / brevity in agents reduces how fast the next turn’s input balloons.
Read next
- What are tokens?
- What are parameters?
- Caveman: token economics in agents
- Claude Opus 4.7: full Anthropic comparison
Vendor numbers change; re-check the model and pricing page for the exact route you use.