What is a context window? LLM 'working memory' and a 2026 snapshot of top models
The context window is how many tokens a model can condition on in one request—input plus the budget reserved for a reply. Here is a plain definition, how it differs from parameter count, and a comparison table for flagship 2026 models (GPT-5.4, Claude 4.7 family, Gemini 3.1 Pro, Meta Llama 4) with links to the canonical docs.
The context window (context length) is the token budget a single request can use: the full prompt you send (system text, user message, prior turns, tool definitions, retrievals, tool results) plus the space allowed for the new completion, up to a max output cap. It is a limit on how much the model can attend to at once, not the same as parameter count or how usage is metered in tokens—though all three appear on the same pricing pages.
If the request does not fit, the system may return an error, truncate older content, or compress it—behavior depends on the client and API.
Context window vs training data vs “memory”
Three ideas often collide in vendor marketing:
Concept
Meaning
Context window
Max tokens one forward pass can attend to right now
Training data
Corpus used during pre-training—not fully visible in any single prompt
Product “memory”
Vendor features (projects, compaction, LLM wikis) that persist facts across sessions
A model “knowing Python” is weights, not context. Pasting your entire repo into chat is context. Confusing the two leads to under-pasting (missing files) or over-pasting (blowing the window on turn three).
Estimating tokens before you send
Rough planning (English prose):
~4 characters per token for ballpark math on technical text.
Code often tokenizes denser than prose—measure with your provider’s tokenizer when billing matters.
Tool definitions in agent prompts consume window every turn unless cached—budget them explicitly.
When in doubt, log usage.prompt_tokens from API responses for a week and build an internal table of “typical task sizes.”
Agent builders: context budget worksheet
Before shipping an agent, fill in:
Bucket
Your estimate
Notes
System + developer instructions
Include tool JSON schemas
Retrieved chunks (RAG)
Top-k × chunk size
Conversation history
Turns × avg turn size
Tool results (last turn)
Often underestimated
Headroom for output
Max completion tokens
If the sum approaches the model window, plan compaction or external memory before launch—not after users hit truncation errors.
FAQ recap (body)
Is context the same as memory? No—memory products persist facts; context is per-request unless you re-inject stored state.
Does a bigger window replace RAG? Rarely. Retrieval scales to corpora larger than any window and keeps citations fresh.
What breaks first on long threads? Usually cost and latency, not the theoretical token cap—monitor both when enabling 1M-class models.
Treat context window as a hard budget in agent design reviews—the same way you cap cloud spend. Teams that skip this step discover truncation only when a long-running support thread silently drops the system prompt.
Long-context pricing reality
Even when a model advertises 1M tokens, vendors may charge premium rates above 272k input tokens or route long requests to slower queues. Budget owners should model p95 prompt size × price per million × daily requests before enabling “paste entire repo” workflows in internal tools.
Bottom line: Context window is the ceiling on one request’s working set—design agents assuming you will hit that ceiling weekly, not once a year.
When comparing Claude, GPT-5.4, and Gemini 3.1 Pro for an agent project, build a spreadsheet with your real prompts—not vendor maxima—as the input column.
For coding agents, tool JSON and MCP server definitions often consume tens of thousands of tokens before the user speaks—budget them in the same worksheet as chat history (MCP guide).
Summary
Context window = max tokens one request can use (prompt + completion budget). It is not parameter count, not training data, not vendor marketing alone. Size agents with real prompts, tool overhead, and compaction plans before you promise “unlimited chat” to users.
Re-check vendor model pages when upgrading—context and max output limits change independently of model name.
Vendor numbers change; confirm on the model page for your API route.
How vendors talk about it
“1M context” (typical 2026 flagship claim) is usually a ceiling on what you can pass in for one call, while max new tokens in the response may still be smaller (for example 64k–128k on sync APIs for some front-tier models).
“Max output tokens” is the cap on the assistant portion of the response for that call.
Long-context products often have separate rate limits or price for very long sessions. OpenAI’s GPT-5.4 guide discusses 272k-scale thresholds for rate limits and pricing; re-read the currentpricing page before sizing a workload.
Tokens define both cost and context window size — this explainer covers both.
The table below is a 2026 documentation snapshot—always re-check the model page in your environment (direct API, Bedrock, Vertex, Foundry, etc.).
Snapshot: context and max output (per published docs)
Caveats: a large on-paper window does not force you to fill it; latency and cost usually grow with effective length. Multimodal inputs (images, video) use additional or separate budgets on many APIs. Hosted marketplaces can impose tighter caps than the base model card.
Why “million-token” class models are expensive to run
Standard transformerattention is heavy in memory for long sequences, so providers invest in kernels, sparsity, chunking, and user-facing compaction to make 1M-class products usable. That engineering is a major reason for long-contexttiers and pricing rules—not only marketing.
Worked example: budgeting one request
Suppose you call Claude Opus 4.7 with a 1M context window and 128k max output (sync API, per Anthropic docs):
Component
Tokens (example)
System prompt + tools
8,000
Retrieved RAG chunks (5 × 2k)
10,000
Prior conversation (20 turns)
120,000
User message
2,000
Subtotal input
140,000
Remaining for output
up to 128,000 (capped by max output, not “860k unused”)
You rarely “use the whole million.” Effective length drives latency and bill—see LLM tokens for pricing math.
Multi-turn trap: each turn re-sends prior messages (unless the client compacts). A 50-turn support thread can exceed a 200k window even if each message looked small in isolation.
RAG vs long context: decision guide
Situation
Prefer
Why
Static corpus (docs, wiki)
RAG + small window
Cheaper; fresher index without re-prompting everything
1M window on Opus/Sonnet but sync max output still capped—read Messages API limits
OpenAI
GPT-5.4 documents 272k vs 1M behavior tiers for rate limits and pricing
Google
Gemini preview models change token accounting for multimodal inputs
Meta (Llama 4)
Model card lists 10M for Scout—your host may impose a lower cap
Bedrock / Vertex
Marketplace wrappers often below vendor maximum
Always size workloads on the route you actually call, not the headline blog number.
Compaction vs truncation
When threads overflow, clients may truncate (drop oldest turns) or compact (summarize). Truncation silently loses facts; compaction loses nuance if summaries are poor. Prefer explicit compaction with stored summaries in an LLM wiki over blind tail dropping.
Vendor numbers change; re-check the model and pricing page for the exact route you use.
Summary
The context window caps how much text (and often multimodal input) a model can condition on in one request—distinct from parameter count and token billing. 2026 flagships advertise 1M-class windows, but max output, hosting route, and compaction behavior still bound real workflows.
Practical rule: prefer RAG + tools + compaction for unbounded corpora; use long context when you need cross-chunk reasoning on a single artifact and can afford latency and cost. Re-read vendor docs before sizing agent memory or support thread limits.