← Blog
explainx / blog

What is a context window? LLM 'working memory' and a 2026 snapshot of top models

The context window is how many tokens a model can condition on in one request—input plus the budget reserved for a reply. Here is a plain definition, how it differs from parameter count, and a comparison table for flagship 2026 models (GPT-5.4, Claude 4.7 family, Gemini 3.1 Pro, Meta Llama 4) with links to the canonical docs.

3 min readYash Thakker
LLM basicsContext windowAnthropicOpenAIGoogle GeminiMeta Llama2026 models

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

What is a context window? LLM 'working memory' and a 2026 snapshot of top models

The context window (context length) is the token budget a single request can use: the full prompt you send (system text, user message, prior turns, tool definitions, retrievals, tool results) plus the space allowed for the new completion, up to a max output cap. It is a limit on how much the model can attend to at once, not the same as parameter count or how usage is metered in tokens—though all three appear on the same pricing pages.

If the request does not fit, the system may return an error, truncate older content, or compress it—behavior depends on the client and API.


How vendors talk about it

  • “1M context” (typical 2026 flagship claim) is usually a ceiling on what you can pass in for one call, while max new tokens in the response may still be smaller (for example 64k–128k on sync APIs for some front-tier models).
  • “Max output tokens” is the cap on the assistant portion of the response for that call.
  • Long-context products often have separate rate limits or price for very long sessions. OpenAI’s GPT-5.4 guide discusses 272k-scale thresholds for rate limits and pricing; re-read the current pricing page before sizing a workload.
Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.

The table below is a 2026 documentation snapshot—always re-check the model page in your environment (direct API, Bedrock, Vertex, Foundry, etc.).


Snapshot: context and max output (per published docs)

Model (line)Context window (tokens)Max output (typical)Where to confirm
Claude Opus 4.71,000,000 (1M)128,000 (sync Messages)Anthropic models overview
Claude Sonnet 4.61,000,000 (1M)64,000same
Claude Haiku 4.5200,00064,000same
GPT-5.4 (API)1M long-context; see guide for 272K vs 1M behaviorup to 128,000 (see model page)Latest model, GPT-5.4
Gemini 3.1 Pro (Preview)1,000,000 input (1M)on the order of 64k output (varies)Google AI dev docs, DeepMind
Llama 4 Maverick (open)1,000,000 (1M) in model cardserver-dependentMeta model card
Llama 4 Scout (open)10,000,000 (10M) in model cardserver-dependentsame

Caveats: a large on-paper window does not force you to fill it; latency and cost usually grow with effective length. Multimodal inputs (images, video) use additional or separate budgets on many APIs. Hosted marketplaces can impose tighter caps than the base model card.


Why “million-token” class models are expensive to run

Standard transformer attention is heavy in memory for long sequences, so providers invest in kernels, sparsity, chunking, and user-facing compaction to make 1M-class products usable. That engineering is a major reason for long-context tiers and pricing rules—not only marketing.


Working inside a finite window

  • RAG and tools: pass pointers and retrieved chunks instead of whole corpora in every turn when possible (MCP patterns help).
  • Vendor compaction: Anthropic context window and OpenAI GPT-5.4 guidance both emphasize managing state rather than infinite raw history.
  • Shorter high-signal output: Caveman / brevity in agents reduces how fast the next turn’s input balloons.

Read next

Vendor numbers change; re-check the model and pricing page for the exact route you use.

Related posts