← Blog
explainx / blog

What is a context window? LLM 'working memory' and a 2026 snapshot of top models

The context window is how many tokens a model can condition on in one request—input plus the budget reserved for a reply. Here is a plain definition, how it differs from parameter count, and a comparison table for flagship 2026 models (GPT-5.4, Claude 4.7 family, Gemini 3.1 Pro, Meta Llama 4) with links to the canonical docs.

3 min readExplainX Team
LLM basicsContext windowAnthropicOpenAIGoogle GeminiMeta Llama2026 models

Includes frontmatter plus an attribution block so copies credit explainx.ai and the canonical URL.

What is a context window? LLM 'working memory' and a 2026 snapshot of top models

The context window (context length) is the token budget a single request can use: the full prompt you send (system text, user message, prior turns, tool definitions, retrievals, tool results) plus the space allowed for the new completion, up to a max output cap. It is a limit on how much the model can attend to at once, not the same as parameter count or how usage is metered in tokens—though all three appear on the same pricing pages.

If the request does not fit, the system may return an error, truncate older content, or compress it—behavior depends on the client and API.


How vendors talk about it

  • “1M context” (typical 2026 flagship claim) is usually a ceiling on what you can pass in for one call, while max new tokens in the response may still be smaller (for example 64k–128k on sync APIs for some front-tier models).
  • “Max output tokens” is the cap on the assistant portion of the response for that call.
  • Long-context products often have separate rate limits or price for very long sessions. OpenAI’s GPT-5.4 guide discusses 272k-scale thresholds for rate limits and pricing; re-read the current pricing page before sizing a workload.

The table below is a 2026 documentation snapshot—always re-check the model page in your environment (direct API, Bedrock, Vertex, Foundry, etc.).


Snapshot: context and max output (per published docs)

Model (line)Context window (tokens)Max output (typical)Where to confirm
Claude Opus 4.71,000,000 (1M)128,000 (sync Messages)Anthropic models overview
Claude Sonnet 4.61,000,000 (1M)64,000same
Claude Haiku 4.5200,00064,000same
GPT-5.4 (API)1M long-context; see guide for 272K vs 1M behaviorup to 128,000 (see model page)Latest model, GPT-5.4
Gemini 3.1 Pro (Preview)1,000,000 input (1M)on the order of 64k output (varies)Google AI dev docs, DeepMind
Llama 4 Maverick (open)1,000,000 (1M) in model cardserver-dependentMeta model card
Llama 4 Scout (open)10,000,000 (10M) in model cardserver-dependentsame

Caveats: a large on-paper window does not force you to fill it; latency and cost usually grow with effective length. Multimodal inputs (images, video) use additional or separate budgets on many APIs. Hosted marketplaces can impose tighter caps than the base model card.


Why “million-token” class models are expensive to run

Standard transformer attention is heavy in memory for long sequences, so providers invest in kernels, sparsity, chunking, and user-facing compaction to make 1M-class products usable. That engineering is a major reason for long-context tiers and pricing rules—not only marketing.


Working inside a finite window

  • RAG and tools: pass pointers and retrieved chunks instead of whole corpora in every turn when possible (MCP patterns help).
  • Vendor compaction: Anthropic context window and OpenAI GPT-5.4 guidance both emphasize managing state rather than infinite raw history.
  • Shorter high-signal output: Caveman / brevity in agents reduces how fast the next turn’s input balloons.

Read next

Vendor numbers change; re-check the model and pricing page for the exact route you use.

Related posts