Your agent's context window is filling up. The conversation started fine: crisp responses, accurate tool selection, instructions followed perfectly. Ten turns in, something shifts. The agent starts ignoring instructions from earlier in the conversation. It picks the wrong tool for a task it handled correctly five turns ago. It produces shorter, less detailed responses. These are the symptoms of context overflow, and they are the most common failure mode in long-running agent sessions.
The context window is everything the model can see in a single turn: the system prompt, the full conversation history, all tool results, any injected context. It has a hard size limit measured in tokens. Once the conversation approaches that limit, the model starts losing information, not gracefully, but in unpredictable ways. The fix is not a bigger context window (though that helps temporarily). The fix is a strategy for deciding what stays, what gets compressed, and what gets dropped.
We will build that strategy in three steps: understand the hierarchy of forgetting, implement a compaction pipeline that applies cheap interventions before expensive ones, and build a budget tracker that triggers compaction at the right time.
Recognize the Problem
Context overflow does not produce an error message. It produces degraded behavior that looks like the model getting dumber. Here are the specific symptoms and what causes each:
The agent ignores system prompt instructions. As the message list grows, the system prompt (which sits at the very beginning) gets pushed further from the model's attention. Instructions that worked in turn 1 stop working in turn 15, even though nothing about the prompt changed.
Tool selection degrades. The model starts calling the wrong tool or calling tools it does not need. This happens because tool schemas compete with conversation history for attention. When the history is large, the model allocates less attention to the tool descriptions.
Responses get shorter and less detailed. The model produces less output when it is processing more input. This is not a bug. It is a consequence of fixed compute budgets. More input tokens means fewer resources available for output generation.
The model "forgets" things it said three turns ago. Information from early in the conversation drops out of the model's effective attention, even if it is technically still in the context window. Long context does not mean equally-attended context.
If you see any of these symptoms in a long-running session, context management is the fix.
The Hierarchy of Forgetting
Not all information in the context window is equally valuable. The hierarchy of forgetting is a framework for deciding what to keep and what to discard, organized from highest fidelity to lowest:
- In-context (message list). Perfect fidelity. The full conversation history, tool results, and system prompt. This is what the model sees. It grows every turn.
- Summary (compressed digest). LLM-generated condensation of older conversation segments. Loses exact phrasing and sequential detail. Saves significant space.
- Long-term storage (fact files). Structured facts persisted between sessions. User preferences, project decisions, explicit corrections. Survives session end.
- Forgotten. Information that was in-context but discarded without preservation. Zero cost, zero fidelity.
The key insight: different categories of information belong at different levels proactively, not as a fallback. Ephemeral tool results (the contents of a file the agent read for a one-off check) belong at level 4. Drop them early and aggressively. User corrections and explicit preferences belong at level 3, so extract them to storage before they get compressed away. The current working context (the plan the agent is executing, the files it is actively editing) belongs at level 1.
Matching information to its appropriate level is the core skill. Do not wait until the context window is full to start thinking about this.
Implement Compaction
When context pressure builds, the instinct is to call the LLM and summarize the conversation. That instinct is expensive and usually wrong. Most context pressure is resolvable without any LLM calls at all.
The principle is cheap interventions first, ordered by cost:
The following implements a compaction pipeline that applies four strategies in ascending cost order:
function maybe_compact(messages: list, window_size: int) -> list:
usage = count_tokens(messages)
headroom = window_size * 0.15
if usage < window_size - headroom:
return messages # no action needed, plenty of room
# Strategy 1: trim oversized tool results (zero LLM cost)
messages = trim_large_tool_results(messages, max_chars=5000)
if count_tokens(messages) < window_size - headroom:
return messages
# Strategy 2: drop oldest messages (zero LLM cost)
messages = drop_oldest_messages(messages, keep_recent=10)
if count_tokens(messages) < window_size - headroom:
return messages
# Strategy 3: summarize older turns (one LLM call, expensive)
split = len(messages) // 2
summary = await llm.summarize(messages[:split])
return [summary_message(summary)] + messages[split:]Why this ordering matters: tool results are the most common cause of context bloat. A single verbose file read or search result can consume thousands of tokens while contributing nothing to the agent's working memory after the turn it was used. Trimming tool results to a size cap costs nothing and often frees enough space to avoid any further intervention.
Dropping old messages is next. The first ten turns of a long conversation are usually safe to drop once their content has been acted on. The agent already incorporated that information into its decisions, so the raw messages are redundant.
LLM-driven summarization is the last resort. It costs an API call, it takes time, and it loses information. Use it only when cheaper strategies are insufficient.
Tip: Compact tool results aggressively. They are often 10x larger than conversation turns. A single
search_filesresult returning 200 matches can consume as many tokens as the previous 20 conversation turns combined.
Build a Budget Tracker
Compaction is reactive: it fires when the window is nearly full. A budget tracker is proactive: it monitors token usage continuously and triggers compaction at the right threshold, before the model starts degrading.
The following implements a budget tracker that monitors usage and fires compaction automatically:
class ContextBudgetTracker:
window_size: int
compact_threshold: float = 0.80 # compact at 80% usage
critical_threshold: float = 0.95 # emergency at 95%
consecutive_failures: int = 0
max_failures: int = 3
function check(self, messages: list) -> CompactionAction:
usage = count_tokens(messages)
ratio = usage / self.window_size
if ratio < self.compact_threshold:
return NO_ACTION
if self.consecutive_failures >= self.max_failures:
return NO_ACTION # circuit breaker: stop retrying
if ratio >= self.critical_threshold:
return EMERGENCY_COMPACT # drop aggressively, skip LLM summary
return STANDARD_COMPACT
function record_success(self):
self.consecutive_failures = 0
function record_failure(self):
self.consecutive_failures += 1Two design decisions in this tracker deserve explanation:
Two thresholds, not one. The standard threshold (80%) gives the compaction pipeline room to work. It can try cheap strategies first because there is still headroom. The critical threshold (95%) triggers emergency compaction that skips the LLM summarization step and goes straight to dropping messages. At 95%, there is no time for an expensive API call.
A circuit breaker. If compaction fails three times in a row (because the context is irrecoverably over limit, or perhaps a single tool result exceeds the entire window), stop retrying. Without this guard, every subsequent turn triggers a doomed compaction attempt that burns an API call and accomplishes nothing.
Wire It Into the Agent Loop
The budget tracker integrates into the agent loop as a check at the start of each turn, before the LLM call:
async function agent_loop(question: str, max_turns: int = 20) -> str:
messages = [system_message(prompt), user_message(question)]
budget = ContextBudgetTracker(window_size=128_000)
for turn in range(max_turns):
# Check budget before every LLM call
action = budget.check(messages)
if action == STANDARD_COMPACT:
messages = await maybe_compact(messages, budget.window_size)
budget.record_success()
elif action == EMERGENCY_COMPACT:
messages = emergency_compact(messages, keep_recent=5)
budget.record_success()
response = await llm.call(messages)
if response.tool_calls is empty:
return response.text
messages.append(response)
for call in response.tool_calls:
result = await dispatch_tool(call.name, call.args)
messages.append(tool_result(call.id, result))
raise RuntimeError("agent exceeded max_turns")The check runs before every LLM call, not after. This ensures the model always receives a context-managed message list, even if the previous turn produced a massive tool result.
What to Compact First
When you need to reclaim space, apply this priority (drop the least valuable first):
- Old tool results, especially large ones. The agent already used them. Trim to a summary or drop entirely.
- System cache entries. Cached context that can be re-fetched if needed later.
- Old conversation turns. The first few turns of a long session. The agent has already incorporated their content into later decisions.
- Compacted summaries. If you have nested summaries (a summary of a summary), the older one can go.
- Recent conversation turns. Drop these only as a last resort. They are the agent's active working memory.
Never drop the system prompt. Never split a tool-call/tool-result pair. The API rejects conversations with a tool result that has no matching tool call.
Related
- Memory and Context. The full memory architecture: the hierarchy of forgetting, the compaction pipeline internals, fact extraction to long-term storage, and circuit breakers for failed compaction.
- Agent Loop. The loop that grows the message list every turn, and why context pressure builds and where compaction integrates.
- Prompt Architecture. How the system prompt is structured for cache efficiency, and why prompt design affects context budget.