Home/Patterns/Core Systems/Memory and Context

Memory and Context

How agent systems manage what the LLM sees, covering compaction pipeline internals, fact extraction subsystem, long-term storage with a closed taxonomy, and production insights for running memory management at scale.

The model can only act on what it sees. Every piece of context (the conversation history, tool results, retrieved facts, system instructions) must fit inside a finite context window. Once the window fills, something has to give: you either compress the old content, move it to external storage, or discard it entirely. There is no fourth option.

This makes memory management one of the most consequential design decisions in an agentic system. Get it wrong and you either run out of context mid-task, spend a fortune on unnecessary LLM calls to summarize content, or lose information the agent needs to finish the job. The mental model that makes this tractable is the hierarchy of forgetting: four levels of memory, each trading fidelity for space. Understanding which level is right for which information, and in what order to apply the cheaper options first, is what separates robust agents from fragile ones.

The Hierarchy of Forgetting

Memory exists at four levels. Each level down the hierarchy trades fidelity for space:

  1. In-context (message list): Perfect fidelity. The full conversation history, tool results, and injected context, all of which the model can see this turn. Cost: tokens. Grows unboundedly if you let it.

  2. Summary (compressed digest): LLM-generated condensation of old conversation segments. Loses sequential detail and exact phrasing. Saves significant space. Cost: one LLM call to generate.

  3. Long-term storage (fact files): Structured facts persisted between sessions, including user preferences, project decisions, feedback, and references. Survives session end. Loses sequential context entirely because facts are extracted, not archived.

  4. Forgotten: Information that was in-context but discarded without preservation. Zero cost, zero fidelity.

In-Context (message list)
Summary (compressed digest)
Long-term Storage (fact files)
Forgotten

The important insight is that each level is a design choice, not a fallback. Developers tend to treat compaction as something that happens when things go wrong. The better framing: different categories of information belong at different levels proactively. Ephemeral tool results (the contents of a file you just read for a one-off check) belong at level 4. Drop them early. User corrections and explicit preferences belong at level 3. Extract them to long-term storage before they get compressed away. Active working context (the current subtask, the plan the agent is executing) belongs at level 1. Matching information to its appropriate level is the core skill.

The Compaction Pipeline

When context pressure builds, the instinct is to call the LLM and summarize the conversation. That instinct is expensive and usually wrong. Most context pressure is resolvable without any LLM calls at all.

Run interventions in cost order, cheapest first and most expensive last:

function maybe_compact(messages, context_window_size):
  usage = count_tokens(messages)
  headroom_needed = context_window_size * 0.15

  if usage < context_window_size - headroom_needed:
    return messages                              # no action needed

  # Level 1: trim large tool results (zero LLM cost)
  messages = trim_oversized_tool_results(messages)
  if tokens_ok(messages): return messages

  # Level 2: drop oldest messages (zero LLM cost)
  messages = drop_oldest_messages(messages)
  if tokens_ok(messages): return messages

  # Level 3: session memory compact (zero LLM cost if memory available)
  result = try_session_memory_compact(messages)
  if result: return result

  # Level 4: LLM-driven summarization (expensive, last resort)
  summary = await llm.summarize(messages[:half])
  return [summary_message(summary)] + messages[half:]

Why this ordering matters: large tool results are the most common cause of context bloat, and trimming them costs nothing. A single verbose file read or search result can consume thousands of tokens while contributing nothing to the agent's working memory after the turn it was used. Dropping old messages is next because the first ten turns of a long conversation are usually safe to drop once their content has been acted on. Session memory compaction comes before full LLM summarization: if structured session memory is available and non-empty, it can serve as the summary with zero LLM cost. Only when all cheaper strategies are exhausted should you pay the cost of an LLM summarization call.

The autocompact circuit breaker. When autocompaction fails, naive implementations retry on the next turn. If the context is irrecoverably over the limit (a single turn's tool results exceed the entire window), every retry attempt fails, burning API calls indefinitely. The fix is a consecutive failure counter:

type AutoCompactTrackingState = {
  compacted: bool
  turn_counter: int
  consecutive_failures: int    # reset to 0 on success
}

MAX_CONSECUTIVE_FAILURES = 3

function auto_compact_if_needed(messages, tracking):
  # Circuit breaker: stop retrying after N consecutive failures.
  # Without this, sessions where context is irrecoverably over limit
  # hammer the API with doomed compaction attempts on every turn.
  if tracking.consecutive_failures >= MAX_CONSECUTIVE_FAILURES:
    return {was_compacted: false}

  if not should_compact(messages):
    return {was_compacted: false}

  # Try session memory compact first (zero LLM cost)
  result = try_session_memory_compact(messages)
  if result:
    tracking.consecutive_failures = 0
    return {was_compacted: true, result: result}

  try:
    result = compact_conversation(messages)
    tracking.consecutive_failures = 0
    return {was_compacted: true, result: result}
  catch error:
    tracking.consecutive_failures += 1
    return {was_compacted: false}

Image stripping before compaction. Compaction sends the message history to an LLM to summarize it. If the history contains images or embedded documents, that compaction call can itself hit the prompt-too-long error, the very problem you were trying to solve. Strip images from message history before any compaction API call, replacing them with text markers ([image], [document]). This is especially common in sessions where users frequently attach screenshots or files.

Tool-use/tool-result pair preservation. When choosing how many messages to keep after compaction, never split a tool-use/tool-result pair. If a kept message contains a tool result, its matching tool-use request must also be in the kept range. The API rejects conversations with dangling tool results (no matching tool-use). After choosing the compaction boundary, scan backwards to include any tool-use messages whose results are in the kept range.

Long-Term Memory

In-context and summary memory are session-scoped. They disappear when the session ends. Long-term memory persists.

The way long-term memory degenerates is predictable: if you save everything, you save nothing. A memory store with no taxonomy becomes a junk drawer. Facts about user preferences sit next to transient error messages sit next to one-time project notes. The signal-to-noise ratio collapses, and retrieval becomes unreliable.

A closed taxonomy fixes this. Define a finite set of memory types that the extraction process is allowed to create:

  • User: role, goals, responsibilities, knowledge. Always private, never shared across sessions or users.
  • Feedback: corrections AND confirmations. Save both, not just corrections. Recording only corrections causes drift toward over-caution because the model never learns what it got right.
  • Project: ongoing work, decisions, deadlines, mostly team scope. Convert relative dates to absolute on extraction ("next Tuesday" becomes the actual date), or they become meaningless later.
  • Reference: pointers to external systems such as dashboards, project trackers, Slack channels, docs.

The critical constraint: don't save anything that already has an authoritative source. If it's in the codebase, it's in the codebase. If it's in git history, it's in git history. Long-term memory is for things that have no other home: preferences, decisions, and feedback that exist only in conversation.

Fact Extraction

Background extraction is not just "run a sub-agent after each turn." It's a complete subsystem with its own agent architecture, concurrency model, and failure recovery strategy.

The forked-agent pattern. A background sub-agent runs after each turn to extract facts. This agent shares the parent's prompt cache (so it's cheap, with no cache creation cost), has a hard turn budget (5 turns), and operates with a restricted tool set that only allows reading and writing to the memory directory. It cannot invoke other tools, spawn sub-agents, or perform any action that could interfere with the main loop.

The efficient parallel strategy: turn 1, issue all read calls in parallel for every memory file that might need updating. Turn 2, issue all write calls in parallel. Never interleave reads and writes. This completes well-behaved extractions in 2-4 turns, keeping extraction cheap even in long sessions.

function extract_memories(messages, last_cursor_uuid):
  # Mutual exclusion: if main agent already wrote memories this turn,
  # skip extraction and advance cursor past the write range.
  if has_memory_writes_since(messages, last_cursor_uuid):
    last_cursor_uuid = messages.last().uuid
    return  # done, no background extraction needed this turn

  new_messages = messages_since(messages, last_cursor_uuid)
  memory_manifest = scan_memory_files(memory_dir)

  result = run_forked_agent(
    prompt: build_extract_prompt(len(new_messages), memory_manifest),
    can_use_tool: memory_dir_only_tool_filter,  # restricted tool set
    max_turns: 5,  # hard cap, prevents rabbit-holes
    shares_parent_prompt_cache: true,  # cheap, no cache creation cost
  )

  # Advance cursor ONLY after success. On failure, reconsider same messages next time
  if result.succeeded:
    last_cursor_uuid = messages.last().uuid

Mutual exclusion guard. If the main agent already wrote to memory files this turn, the background extractor skips its run and advances its cursor past those messages. The main agent and the background extractor are mutually exclusive per turn. They never both write to memory in the same turn.

The extraction cursor. The cursor is a UUID identifying the last message that was successfully processed by the extractor. After each successful extraction run, the cursor advances to the most recent message. On failure, the cursor does NOT advance. Those messages are reconsidered on the next run. This gives you at-least-once extraction semantics: a failure never permanently loses messages.

Edge case: if the cursor UUID is missing from the message list (compaction removed the message the cursor was pointing to), fall back to counting all model-visible messages. Never disable extraction permanently because the cursor was lost. The fallback makes it recoverable.

The entrypoint index. A single manifest file lists all memory files. Cap this file at a line limit (200 lines) and a byte limit (25KB). Without caps, a single long-line memory file can cause the manifest to consume a disproportionate share of the context budget. When either cap is exceeded, append a warning rather than silently truncating. The warning signals that memory is getting large and should be cleaned up.

The Context Budget

The context window is not symmetric. You cannot fill it to 100% capacity.

The model needs space to write its response. If you allow the input to fill the window completely, the model will truncate its own output mid-response, a failure mode that is silent, confusing, and hard to debug. Reserve a response buffer (commonly 13,000 to 15,000 tokens for models with 200K+ context windows) before triggering compaction.

This means the effective context budget is:

effective_budget = context_window_size - response_headroom
compaction_trigger = effective_budget * 0.85   # trigger at 85%, not 100%

Trigger compaction when you approach the effective budget, not when you hit it. Compaction itself takes tokens (function calls, results). If you wait until you're at the limit, the compaction process may push you over before it finishes.

Production Considerations

The circuit breaker saves real money at scale. Three consecutive compaction failures is the right cutoff because most transient failures (API timeouts, temporary overload) resolve within 1-2 retries. If you've failed 3 times, the session context is almost certainly irrecoverably over the limit, and retrying does not help. Without this circuit breaker, sessions in this state generate one failed compaction attempt per turn for the duration of the session. At scale, this pattern burns hundreds of thousands of API calls per day across a user base.

MEMORY.md index truncation is not optional. The memory manifest is injected into the prompt every turn. A 197KB manifest file consumed entirely within 200 lines (all long lines) will consume a substantial portion of your effective context budget before the conversation even begins. The byte cap catches this: when a memory file is pathologically long per-line, the byte cap triggers before the line cap, preventing prompt hijacking via long memory entries.

Image stripping is a compaction prerequisite. In sessions where users attach images or screenshots, the accumulated message history can contain megabytes of image data. The compaction API call that's supposed to summarize this history will hit prompt-too-long before it can complete, making it impossible to compact the very content that's causing the overflow. Strip images first, then compact. The text markers ([image], [document]) preserve conversation structure without the token cost.

Cursor UUID fallback prevents permanent extraction failure. Compaction can remove old messages from the message list. If the cursor was pointing to a removed message, a naive implementation would conclude "cursor not found, something went wrong" and stop extracting. The fallback (count model-visible messages) means extraction continues even when compaction has cleared the anchor message. Without this fallback, long-running sessions that compact frequently would permanently stop extracting facts after the first compaction that hits the cursor.

Thinking block co-location at compaction boundaries. When compaction chooses its boundary, it operates on messages. But some LLM responses emit multiple content blocks (a thinking block and a tool-use block) that share the same message identifier but are emitted as separate streaming events. If the compaction boundary falls between these blocks, the kept range contains a tool-use with no associated thinking block, which can cause message normalization to fail. After choosing the boundary, check whether any message at or near the boundary has a sibling thinking block, and include it.

Best Practices

  • DO run compaction in cost order: trim tool results, drop old messages, session memory compact, LLM summarize
  • DO use a closed taxonomy for long-term memory with four types: user, feedback, project, reference
  • DO track consecutive compaction failures and circuit-break after 3
  • DO strip images and documents from message history before compaction API calls
  • DO record both corrections and confirmations in feedback memory. Corrections-only causes drift toward over-caution.
  • DO convert relative dates to absolute during extraction ("next Tuesday" becomes the ISO date)
  • DO cap the memory manifest at a line limit and byte limit
  • DON'T save facts that have an authoritative source elsewhere (git, codebase, external system)
  • DON'T split tool-use/tool-result pairs when choosing compaction boundaries
  • DON'T trigger compaction at 100% capacity. Trigger at 85% of effective budget to leave room for the compaction process itself.
  • DON'T permanently disable extraction when the cursor UUID is missing. Use the count-based fallback.
  • DON'T give the background extractor unrestricted tool access. Restrict it to memory directory operations only.
  • Agent Loop Architecture: The agent loop produces the message list that grows across turns and eventually causes the context pressure this page teaches you to manage.

  • Prompt Architecture: The prompt fills the static portion of the context window before any conversation begins. Memory content is injected into the dynamic zone each turn, so prompt structure and memory size share the same budget.

  • Tool System: Tool results are the primary driver of context bloat. Understanding tool result sizing and trimming connects directly to the compaction pipeline.

  • Error Recovery: The circuit breaker pattern here (consecutive failure counter that disables retries) mirrors the circuit breaker pattern for API retries. Both protect against runaway retry loops, and both use the same failure-count-with-success-reset structure.

  • Observability and Debugging: Compaction events, context budget metrics, and cost-per-model tracking all feed into the observability layer. When compaction misfires or context pressure spikes unexpectedly, the session tracing and event log covered on that page are the primary debugging surface.

  • Pattern Index: All patterns from this page in one searchable list, with context tags and links back to the originating section.

  • Glossary: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.