Every message exchange in an agentic session adds tokens to the context window. A user question, an assistant response, a tool call, a tool output — all of it accumulates. After 20 turns, the conversation history is often the largest single component in the context window, consuming budget that could go to retrieved documents, tool definitions, or the current task.
The problem isn't just token cost. As history grows, critical early instructions (system prompt, task definition, hard constraints) get pushed into the middle of the context window, where attention is weakest. The model starts drifting from its original instructions — following recent context more than early setup. Conversation history management is context engineering for the temporal dimension of agent sessions.
Why history accumulates so fast in agentic systems
Single-turn chatbots don't have this problem. Each interaction is isolated; the context window resets between calls.
Agentic systems are different. An agent working on a software debugging task might:
- Receive the initial task description (300 tokens)
- Call
read_fileand get the file contents (2,000 tokens in tool output) - Call
run_testsand get the test results (500 tokens) - Propose a fix and wait for user confirmation (200 tokens)
- Call
edit_fileto apply the fix (300 tokens) - Call
run_testsagain to verify (500 tokens) - Summarize the fix and close the task (200 tokens)
After 7 turns and 4 tool calls, the history contains ~4,000 tokens — and this is a simple debugging task. Complex multi-step tasks can easily accumulate 20,000-50,000 tokens of history across 40+ turns, plus the tool outputs from each step.
Without management, this hits context limits (typically 32k-200k tokens depending on model), degrades quality, and drives up cost per turn linearly as the session extends.
The four history management strategies
Strategy 1: Full retention
Keep all conversation turns in the context window until the limit is hit, then either truncate (drop the oldest turns) or error out.
When to use: Short sessions (< 10 turns), or systems where complete history is genuinely critical for correctness (e.g., a negotiation assistant that needs verbatim memory of all commitments).
Token profile: Linear growth. Turn 30 costs significantly more than turn 1.
Failure mode: At some point, the context window fills. Naive truncation (dropping oldest turns) loses the system prompt and early task definition — exactly the highest-value content. You need either a hard session length limit or a fallback strategy.
Implementation:
messages = []
def add_turn(role: str, content: str):
messages.append({"role": role, "content": content})
# Check if we're approaching the limit
total_tokens = estimate_tokens(messages)
if total_tokens > 0.85 * CONTEXT_LIMIT:
# Trigger a different strategy here
apply_summarization_or_pruning(messages)
Full retention is a reasonable starting point for low-turn systems. Treat it as the default that you replace with a smarter strategy once session lengths grow.
Strategy 2: Sliding window
Keep only the N most recent turns. Older turns are dropped entirely.
When to use: Sessions where recency matters more than total history (customer service conversations, interactive coding sessions where each step supersedes the last).
Token profile: Stable. After the window fills, adding a new turn drops the oldest turn; total history tokens remain roughly constant.
Failure mode: Critical context from early turns gets lost. If the user specified an important constraint in turn 2 and you're now at turn 30, that constraint has dropped out of the window. The model has no way to know it existed.
Mitigation: Never let the system prompt or task definition slide out of the window. Many implementations keep the system prompt pinned at the front and apply the sliding window only to the user/assistant exchange:
SYSTEM_PROMPT = [{"role": "system", "content": system_content}]
MAX_HISTORY_TURNS = 10
def build_context(history: list, new_message: str) -> list:
# Pin system prompt at front
recent_history = history[-MAX_HISTORY_TURNS:]
return SYSTEM_PROMPT + recent_history + [{"role": "user", "content": new_message}]
The window size N is a token budget decision, not an arbitrary number. Estimate the average tokens per turn in your system and set N so the history block stays within your history budget (typically 30-40% of the total context budget, leaving room for retrieval and the current task).
Strategy 3: Summarization
Periodically convert old conversation turns into a compact summary, then drop the original turns. The summary replaces the dropped turns in the context.
When to use: Long sessions where you need to preserve semantic content from early turns without paying full token cost. Research assistants, coding agents working on large tasks, any session expected to exceed 20+ turns.
Token profile: Stepped. Grows until a summarization threshold, then drops back down to summary size + recent turns, then grows again until the next summarization.
Failure mode: The summarization model (usually the same LLM you're using for the task, in a separate call) loses nuance and specificity. A verbatim tool output showing a stack trace becomes "there was an error in the authentication module" — semantically correct but losing the specific line number and error type that might matter later.
Implementation pattern:
SUMMARIZE_THRESHOLD = 15 # turns
KEEP_RECENT = 5 # turns to keep verbatim
def maybe_summarize(messages: list, system_prompt: str) -> list:
non_system = [m for m in messages if m["role"] != "system"]
if len(non_system) < SUMMARIZE_THRESHOLD:
return messages
# Keep recent turns verbatim
to_summarize = non_system[:-KEEP_RECENT]
recent = non_system[-KEEP_RECENT:]
# Generate summary
summary_text = call_llm(
system="Summarize the following conversation, capturing all decisions made, "
"constraints established, tool outputs that revealed important information, "
"and the current state of the task. Be specific and preserve numbers, "
"names, and error messages verbatim.",
messages=to_summarize
)
summary_message = {
"role": "system",
"content": f"[CONVERSATION SUMMARY - replaces turns 1-{len(to_summarize)}]\n{summary_text}"
}
return [{"role": "system", "content": system_prompt}] + [summary_message] + recent
The summary injection is critical — frame it explicitly as a summary so the model understands it's a compressed representation of earlier context, not a new instruction.
What to preserve in a summary
Not all history content is equally valuable to preserve. Prioritize:
- Decisions made — user confirmed approach X, model proposed Y and user accepted, constraint Z was established
- Numerical specifics — error codes, line numbers, counts, amounts (don't summarize "there was an error" — preserve the error)
- Tool outputs that changed the task state — file contents read, test results that revealed a bug, API responses with specific data
- Corrections — if the user corrected the model on something, that correction needs to survive the summary
De-prioritize:
- Transitional acknowledgments ("Got it, let me look at that")
- Unsuccessful attempts that led to the current approach ("I tried X but it failed, now trying Y" → just record the current approach)
- Tool calls that produced empty or error results (unless the error was informative)
Strategy 4: Selective pruning
Score each conversation turn by its current relevance to the active task. Drop low-relevance turns; keep high-relevance turns regardless of recency.
When to use: Complex multi-phase tasks where different turns are important at different phases. A coding task that has a requirements-gathering phase, a design phase, and an implementation phase — the design phase needs to be kept even when implementation is in progress.
Token profile: Variable. Drops when a phase completes and its turns become less relevant. Grows during phases with many relevant outputs.
Failure mode: Scoring complexity. Getting relevance scoring right requires task-aware logic — a general-purpose relevance scorer that doesn't understand your task structure will prune the wrong things.
Implementation sketch:
def score_turn_relevance(turn: dict, current_task_state: dict) -> float:
"""
Score a turn's relevance to the current task.
Returns 0.0 (irrelevant, safe to prune) to 1.0 (critical, never prune).
"""
# System prompt: never prune
if turn["role"] == "system":
return 1.0
# Turns containing user-specified constraints: always keep
if contains_constraint(turn["content"]):
return 1.0
# Tool outputs that produced data still in use: keep
if turn["role"] == "tool" and is_data_still_referenced(turn, current_task_state):
return 0.9
# Acknowledgment turns without substantive content: prune
if is_transitional(turn["content"]):
return 0.1
# Turns from completed phases: medium relevance
if turn["phase"] != current_task_state["phase"]:
return 0.4
return 0.7 # default: keep recent turns
PRUNE_THRESHOLD = 0.3
def build_pruned_context(messages: list, task_state: dict) -> list:
return [m for m in messages if score_turn_relevance(m, task_state) > PRUNE_THRESHOLD]
Selective pruning is the most powerful strategy but requires the most investment. It's appropriate for production systems with well-defined task structures where the simpler strategies (sliding window, summarization) result in measurable quality losses.
Hybrid strategies
The four strategies aren't mutually exclusive. Most production systems combine them:
Summarization + recency window: Summarize turns older than the recency window, keep recent turns verbatim. The most common production pattern.
Selective pruning + summarization: Score turns for relevance, keep high-relevance turns verbatim, summarize medium-relevance turns, drop low-relevance turns. More complex, but appropriate for long multi-phase tasks.
Dynamic window based on token budget: Instead of a fixed N-turn window, keep as many recent turns as fit within a target token budget (e.g., 40% of context). The window shrinks as turns get longer (verbose tool outputs) and expands as turns get shorter.
Multi-agent history management
In multi-agent systems, history management operates at two levels:
Sub-agent level: Each sub-agent maintains its own conversation history for its sub-task. This history is typically short — sub-agents are given focused tasks that complete in few turns. Apply full retention or a small sliding window.
Orchestrator level: The orchestrator doesn't see sub-agent history verbatim. It receives structured outputs or summaries from sub-agents. The orchestrator's "history" is a record of tasks assigned, sub-agents spawned, and outcomes received — not the internal dialogue of each sub-agent.
This architecture is important: if sub-agent history flowed into the orchestrator verbatim, the orchestrator context would fill rapidly with implementation details it doesn't need. The sub-agent history management is an implementation concern; the orchestrator's context stays focused on coordination.
Measuring history management effectiveness
Three metrics indicate whether your history management is working:
Cost per task completion. If cost per turn rises linearly without history management and stays flat with summarization, summarization is working. Track this for your specific session length distribution.
Quality drift. Run the same benchmark task with fresh context vs. context that includes 20+ turns of accumulated history. If quality degrades substantially with history, your management strategy isn't preserving the right content.
Repair turns. Count how often the model asks clarifying questions or reverses earlier decisions mid-session. This signals that important context from early turns has been lost. If repair turns increase as sessions lengthen, your history management is losing load-bearing context.
Summary: history management decision guide
| Session length | Recommended strategy |
|---|---|
| < 5 turns | Full retention |
| 5-15 turns | Sliding window (pin system prompt) |
| 15-30 turns | Summarization with 5-turn recency window |
| 30+ turns | Summarization + selective pruning |
| Multi-agent, sub-tasks | Per-sub-agent sliding window; orchestrator receives summaries |
Start with the simplest strategy that works for your expected session length. Add complexity only when you can measure a quality or cost problem that the simpler strategy doesn't solve. The overhead of selective pruning is only worth it if you have data showing that summarization or sliding windows are losing decision-critical content.
One more thing: history is not only about tokens
The instinct with history management is to optimize for tokens — fewer tokens, lower cost, less risk of hitting context limits. But the deeper reason to manage history is attention quality. A well-pruned 8,000-token history where every turn is load-bearing is not just cheaper than a 30,000-token history with redundant content — it's higher quality. The model allocates more relative attention to each turn when the history is tight and relevant. History management is as much about signal-to-noise ratio as it is about token cost.
History is context. Manage it deliberately.