# ClaudePedia - Full Knowledge Base

Source: https://claudepedia.dev
Documents: 24

---

# Build Your First Agent

Source: https://claudepedia.dev/docs/quickstart
Section: Guides

Go from zero to a working agent mental model (the loop, one tool, and termination) in a single page.

A chatbot responds once. An agent loops until the task is done. That one difference, a loop, is what makes an LLM capable of reading files, calling APIs, running computations, and delivering answers grounded in real information rather than training knowledge alone.

The core idea: each turn, the model receives the full conversation history and decides what to do next. If it needs more information, it calls a tool. The result comes back, gets appended to the history, and the model decides again. When the model has enough to answer, it responds without calling any tool, and that's the signal to stop.

We'll build this in two steps: first the bare loop, then one tool wired in. By the end, you'll have a clear mental model of how every agent system works, and a foundation for the depth that follows.

## Step 1: The Loop

The agent loop is a function that takes a question and returns an answer. Here's the complete structure:

```python
function agent_loop(question, max_turns = 20):
  messages = [
    system_message("You are a helpful assistant."),
    user_message(question)
  ]

  for turn in range(max_turns):
    response = await llm.call(messages)

    if response.tool_calls is empty:
      return response.text

    messages.append(response)
    for call in response.tool_calls:
      result = await dispatch_tool(call.name, call.args)
      messages.append(tool_result(call.id, result))

  raise RuntimeError("agent exceeded max_turns without answering")
```

Three things make this work:

**The messages list is the agent's memory.** The model sees the entire list on every turn: the original question, all previous responses, all tool results. This is how it maintains context across multiple steps. The list grows as the loop runs. A simple task might finish in two turns, while a complex one might take a dozen.

**Termination is the absence of tool calls.** When the model decides it has enough information to answer, it produces a response with no `tool_calls`. That empty field is the signal: return the answer and stop. The agent doesn't announce "I'm done." It simply stops asking for tools.

**`max_turns` is a safety net, not a budget.** A well-designed agent should terminate naturally long before hitting 20 turns. Exceeding `max_turns` is an error condition. The model didn't converge, and you should surface that failure explicitly rather than returning a partial or empty answer. Without this guard, a confused model runs forever.

## Step 2: Add a Tool

Tools are how the agent acts on the world. A tool is a function plus metadata: a name, a description, and a parameter schema. The LLM never sees your function code. It only sees the schema, which tells it what arguments to pass.

Here's a `read_file` tool:

```python
tools = [
  {
    name: "read_file",
    description: "Read the contents of a file at the given path",
    parameters: {
      path: { type: "string", description: "Absolute path to the file" }
    }
  }
]

# The implementation is separate. The LLM never sees this.
function read_file_impl(path):
  return filesystem.read(path)
```

The metadata (the name, description, and parameter types) is what the model reasons about when deciding whether to call this tool and what arguments to pass. Your function body is invisible. This is the key insight: **the schema is the interface, the implementation is the plumbing.**

Wiring the tool into the loop takes two changes:

```python
response = await llm.call(messages, tools=tools)  # pass tools here
```

And a dispatch function to route calls to implementations:

```python
function dispatch_tool(name, args):
  if name == "read_file":
    return read_file_impl(args.path)
  raise Error(f"unknown tool: {name}")
```

That's it. The loop from Step 1 is unchanged. The model now knows `read_file` exists, calls it when it needs file contents, and the dispatcher runs the real function. In a production system, the dispatcher becomes a registry that maps names to implementations automatically. See [Tool System](/docs/tool-system) for the full pattern.

> **Note:** As the agent reads more files and accumulates tool results, the messages list grows. For long-running tasks or large files, this becomes a context management problem. See [Memory and Context](/docs/memory-and-context) for compaction strategies.

## Where to Go From Here

Every concept this page introduces has a deeper Core Systems page with production-grade detail, failure modes, and non-obvious insights.

- **[Agent Loop Architecture](/docs/agent-loop).** The full lifecycle: async generator patterns for streaming, three distinct abort paths, token budget termination, and why loop cleanup order matters under cancellation.
- **[Tool System](/docs/tool-system).** Registration, dispatch, concurrency classes (so read-only tools run in parallel), behavioral flags, schema flattening for LLM APIs, and fail-closed defaults.
- **[Memory and Context](/docs/memory-and-context).** What happens when the messages list grows too long: compaction strategies, LLM-driven fact extraction, session memory budgets, and context pruning triggers.
- **[Prompt Architecture](/docs/prompt-architecture).** How to structure the system message for cache efficiency, multi-section composition, volatile section registration, and behavioral calibration.
- **[Error Recovery](/docs/error-recovery).** When the LLM call fails, when a tool crashes, and the four-rung escalation ladder: retry, fallback, partial result, escalate.
- **[Safety and Permissions](/docs/safety-and-permissions).** Controlling what tools can do: the six-source permission cascade, graduated trust levels, and bypass-immune safety checks that hold even in auto mode.
- **[Multi-Agent Coordination](/docs/multi-agent-coordination).** When one loop isn't enough: spawning backends, file-based mailbox communication, tool partitioning between coordinator and workers, and session reconnection for resumed agents.
- **[Streaming and Events](/docs/streaming-and-events).** Delivering results as they happen: typed event streams, priority-based dispatch, capture/bubble phases, and screen-diffing output models.
- **[Pattern Index](/docs/pattern-index).** All patterns mentioned in this quickstart in one searchable list, with links to the deeper Core Systems pages where each pattern is explained.
- **[Glossary](/docs/glossary).** Definitions for every term introduced on this page, from agent loop to tool dispatch.


---

# Design a Custom Tool

Source: https://claudepedia.dev/docs/design-a-custom-tool
Section: Guides

How to build a production-ready agent tool from scratch: schema, concurrency class, behavioral flags, and dispatch wiring.

Your agent can talk, but it cannot act until you give it tools. The quickstart showed how to wire a simple `read_file` tool into the agent loop. This guide goes deeper: we will build a database query tool from scratch, making every design decision explicit along the way. By the end, you will know how to define a schema the model can reason about, choose a concurrency class that keeps dispatch safe, set behavioral flags for cross-cutting concerns, and register the finished tool so the dispatcher can find it.

The running example is a `query_database` tool that accepts a SQL query and returns rows. We chose this because it hits every interesting design surface: the schema needs careful descriptions, the concurrency class depends on whether the query mutates state, the tool needs permission checks, and the result can be large enough to require size limits.

## Define the Schema

The schema is the interface between the model and your tool. The model never sees your implementation code. It sees the schema (the name, description, and typed parameters) and uses that to decide whether to call the tool and what arguments to pass. This makes schema design the single highest-leverage activity in tool building.

A schema has three parts: a name the model can reference, a description that explains what the tool does and when to use it, and a set of typed parameters with their own descriptions.

The following defines the schema for our database query tool:

```python
tool_schema = {
  name: "query_database",
  description: "Execute a read-only SQL query against the application database and return matching rows. Use this when you need to look up data, check record counts, or verify database state. Do not use for INSERT, UPDATE, or DELETE operations. Use write_database for mutations.",
  parameters: {
    query: {
      type: "string",
      description: "A read-only SQL SELECT statement. Must not contain INSERT, UPDATE, DELETE, DROP, or ALTER."
    },
    max_rows: {
      type: "integer",
      description: "Maximum number of rows to return. Defaults to 100. Use a lower value when you only need to check existence.",
      default: 100
    }
  }
}
```

Three things to notice about this schema:

**The description says when to use it and when not to.** "Use this when you need to look up data" gives the model positive guidance. "Do not use for INSERT, UPDATE, or DELETE" gives it a clear boundary. Models make better tool selection decisions when the description includes both cases.

**Parameter descriptions carry constraints.** The `query` parameter description says "Must not contain INSERT, UPDATE, DELETE, DROP, or ALTER." This is not enforced by the schema itself. It is guidance for the model. The actual enforcement happens in validation (next section). But stating the constraint in the description means the model is less likely to generate a violating input in the first place.

**The `max_rows` parameter has a default.** This means the model can call `query_database(query="SELECT * FROM users")` without specifying `max_rows`, and the system fills in 100. Defaults reduce the number of decisions the model has to make per call, which reduces error rates.

> **Tip:** Schema descriptions matter more than names. The model reads the description to decide when to call your tool, so invest your design effort there. A tool named `qdb` with a great description outperforms a tool named `query_database` with a vague one.

## Choose a Concurrency Class

The concurrency class tells the dispatcher whether it is safe to run your tool in parallel with other tools. When the model requests multiple tool calls in a single response, the dispatcher uses concurrency classes to decide which calls can run simultaneously and which must be serialized.

There are three classes:

- **`READ_ONLY`:** The tool only reads. Multiple instances can run concurrently without interference. File searches, API lookups, and database SELECT queries are read-only.
- **`WRITE_EXCLUSIVE`:** The tool writes to shared state. It must run serially, meaning no other tool runs at the same time. File writes, database mutations, and email sends are write-exclusive.
- **`UNSAFE`:** The tool has side effects that are hard to bound. It runs in isolation, often in a subprocess or sandbox. Arbitrary shell execution is the canonical example.

For our `query_database` tool, the class is `READ_ONLY` because it executes SELECT statements and does not modify state.

But here is a subtlety: **concurrency class can depend on the input.** A more general `execute_sql` tool that accepts any SQL statement cannot be statically classified. A SELECT is read-only, an UPDATE is write-exclusive, and a DROP TABLE is unsafe. In that case, the tool implements a runtime check:

```python
function is_concurrency_safe(parsed_input: QueryInput) -> bool:
  query_upper = parsed_input.query.strip().upper()
  if query_upper.startswith("SELECT"):
    return True
  return False   # conservative: anything non-SELECT serializes
```

The dispatcher calls this function with the parsed input before deciding on parallel execution. If the function throws (because the input failed to parse, for instance), the dispatcher defaults to `False`. It never optimistically assumes concurrent dispatch is safe.

For our focused `query_database` tool, we hardcode `READ_ONLY` because the schema and validation already constrain it to SELECT queries. The runtime check is for tools with broader input surfaces.

## Set Behavioral Flags

Behavioral flags declare cross-cutting concerns as data rather than embedding them in the function body. They tell the rest of the system (the permission layer, the dispatcher, the logging system) how to handle this tool without reading its implementation.

The following attaches behavioral flags to our tool:

```python
tool_config = {
  schema: tool_schema,
  concurrency_class: "READ_ONLY",
  behavioral_flags: {
    is_destructive: False,
    requires_permission: True,
    interrupt_behavior: "block",
    max_result_size_chars: 50_000,
    timeout_seconds: 30,
    retry_on_failure: True,
    max_retries: 2
  }
}
```

Each flag serves a specific downstream consumer:

- **`is_destructive`:** Tells the permission system whether this tool modifies state that cannot be undone. Our query tool is not destructive. A `delete_records` tool would be.
- **`requires_permission`:** Tells the permission cascade to check before execution. Even read-only tools might need permission if they access sensitive data. Database queries can expose PII, so we set this to `True`.
- **`interrupt_behavior`:** Tells the dispatcher what to do if the user cancels mid-execution. `"block"` means wait for the current call to finish before stopping. `"abort"` would cancel immediately. Database queries are safe to let finish.
- **`max_result_size_chars`:** Caps the result size. A SELECT without a LIMIT on a large table can return megabytes. Capping at 50,000 characters prevents a single tool result from consuming the entire context window.
- **`timeout_seconds`:** Prevents a slow query from blocking the agent indefinitely. 30 seconds is generous for a database query, so adjust based on your workload.
- **`retry_on_failure`** and **`max_retries`:** Transient database errors (connection reset, lock timeout) are worth retrying. Two retries before giving up is a reasonable default.

> **Tip:** Default to `requires_permission: True` for any tool you have not explicitly classified. Fail-closed is the only safe default, and you can always relax it later. See [Safety and Permissions](/docs/safety-and-permissions) for how the permission cascade evaluates these flags.

## Implement the Tool

The implementation is the function that does the actual work. It receives parsed, validated input and returns a result. The schema and behavioral flags handle everything upstream. By the time your function runs, the input has passed schema validation, semantic validation, and permission checks.

The following implements the query tool with input validation:

```python
async function query_database_impl(input: QueryInput, context: ToolContext) -> ToolResult:
  # Semantic validation: reject mutations even if they got past schema parsing
  forbidden = ["INSERT", "UPDATE", "DELETE", "DROP", "ALTER", "TRUNCATE"]
  query_upper = input.query.strip().upper()
  for keyword in forbidden:
    if keyword in query_upper:
      return error_result(f"Rejected: query contains forbidden keyword '{keyword}'")

  # Execute the query with timeout protection
  try:
    rows = await context.database.execute(
      input.query,
      max_rows=input.max_rows,
      timeout=context.tool_config.timeout_seconds
    )
  except TimeoutError:
    return error_result("Query timed out after 30 seconds")
  except ConnectionError:
    return error_result("Database connection failed")

  # Format and size-cap the result
  formatted = format_rows_as_table(rows)
  if len(formatted) > context.tool_config.max_result_size_chars:
    formatted = formatted[:context.tool_config.max_result_size_chars]
    formatted += f"\n[Truncated - {len(rows)} rows total, showing first portion]"

  return success_result(formatted)
```

Two implementation patterns are worth calling out:

**Double validation.** The semantic validation (checking for forbidden keywords) runs even though the schema description already tells the model not to send mutations. Defense in depth: the model might ignore the description, or a future schema change might weaken the constraint. Never rely on the model following instructions as your only enforcement layer.

**Error results, not exceptions.** The function returns `error_result(...)` instead of throwing. This is critical: the agent loop needs a `tool_result` message for every `tool_call` in the conversation history. If your tool throws an exception, the dispatcher catches it and converts it to an error result anyway. But returning error results explicitly gives you control over the error message the model sees.

## Wire Into Dispatch

The final step is registering the tool so the dispatcher can find it. The registry pattern maps tool names to their configurations and implementations:

```python
# Registry: maps tool names to configs and implementations
tool_registry = ToolRegistry()

tool_registry.register(
  name="query_database",
  config=tool_config,
  implementation=query_database_impl
)

# Dispatch: the agent loop calls this for every tool call
async function dispatch_tool(name: str, args: dict, context: ToolContext) -> ToolResult:
  tool = tool_registry.get(name)
  if tool is None:
    return error_result(f"Unknown tool: {name}")

  # Schema validation (phase 1)
  parsed = tool.config.schema.parse(args)
  if not parsed.success:
    return error_result(f"Invalid arguments: {parsed.error}")

  # Permission check (uses behavioral flags)
  if tool.config.behavioral_flags.requires_permission:
    permission = await check_permission(name, parsed.data, context)
    if permission.denied:
      return error_result(f"Permission denied: {permission.reason}")

  # Execute
  return await tool.implementation(parsed.data, context)
```

The dispatcher does not know anything about databases, SQL, or query formatting. It knows how to validate schemas, check permissions, and call implementations. Adding a new tool means adding a registry entry. The dispatch logic stays untouched.

In a production system, the registry also handles tool listing for the LLM (extracting schemas into the format the model API expects), dynamic tool sets (tools that appear or disappear based on context), and tool aliases for backward compatibility. See [Tool System](/docs/tool-system) for the full registry pattern.

## Putting It All Together

Here is the complete tool, from schema to registration, in one view:

```python
# 1. Schema - what the model sees
schema = {
  name: "query_database",
  description: "Execute a read-only SQL SELECT query and return rows.",
  parameters: {
    query: { type: "string", description: "A read-only SELECT statement." },
    max_rows: { type: "integer", description: "Max rows to return.", default: 100 }
  }
}

# 2. Config - what the system sees
config = ToolConfig(
  schema=schema,
  concurrency_class="READ_ONLY",
  behavioral_flags=Flags(
    is_destructive=False,
    requires_permission=True,
    max_result_size_chars=50_000,
    timeout_seconds=30
  )
)

# 3. Implementation - what actually runs
async function query_database(input, context) -> ToolResult:
  rows = await context.database.execute(input.query, max_rows=input.max_rows)
  return success_result(format_rows_as_table(rows))

# 4. Registration - wire it in
registry.register("query_database", config, query_database)
```

Four decisions, four pieces of code, one tool. The schema is the interface. The config is the metadata. The implementation is the plumbing. The registration is the wiring. Every production tool follows this pattern.

## Related

- **[Tool System](/docs/tool-system).** The full tool architecture: concurrency partitioning, the dispatch algorithm, two-phase validation, dynamic tool sets, and schema flattening for LLM APIs.
- **[Safety and Permissions](/docs/safety-and-permissions).** How the permission cascade evaluates `requires_permission` and `is_destructive` flags before your tool runs.
- **[Agent Loop](/docs/agent-loop).** The loop that calls dispatch, and how tool results feed back into the conversation history.


---

# Manage Conversation Context

Source: https://claudepedia.dev/docs/manage-conversation-context
Section: Guides

How to keep your agent's context window from overflowing. Covers recognizing the symptoms, applying cheap trims before expensive summarization, and tracking token budgets.

Your agent's context window is filling up. The conversation started fine: crisp responses, accurate tool selection, instructions followed perfectly. Ten turns in, something shifts. The agent starts ignoring instructions from earlier in the conversation. It picks the wrong tool for a task it handled correctly five turns ago. It produces shorter, less detailed responses. These are the symptoms of context overflow, and they are the most common failure mode in long-running agent sessions.

The context window is everything the model can see in a single turn: the system prompt, the full conversation history, all tool results, any injected context. It has a hard size limit measured in tokens. Once the conversation approaches that limit, the model starts losing information, not gracefully, but in unpredictable ways. The fix is not a bigger context window (though that helps temporarily). The fix is a strategy for deciding what stays, what gets compressed, and what gets dropped.

We will build that strategy in three steps: understand the hierarchy of forgetting, implement a compaction pipeline that applies cheap interventions before expensive ones, and build a budget tracker that triggers compaction at the right time.

## Recognize the Problem

Context overflow does not produce an error message. It produces degraded behavior that looks like the model getting dumber. Here are the specific symptoms and what causes each:

**The agent ignores system prompt instructions.** As the message list grows, the system prompt (which sits at the very beginning) gets pushed further from the model's attention. Instructions that worked in turn 1 stop working in turn 15, even though nothing about the prompt changed.

**Tool selection degrades.** The model starts calling the wrong tool or calling tools it does not need. This happens because tool schemas compete with conversation history for attention. When the history is large, the model allocates less attention to the tool descriptions.

**Responses get shorter and less detailed.** The model produces less output when it is processing more input. This is not a bug. It is a consequence of fixed compute budgets. More input tokens means fewer resources available for output generation.

**The model "forgets" things it said three turns ago.** Information from early in the conversation drops out of the model's effective attention, even if it is technically still in the context window. Long context does not mean equally-attended context.

If you see any of these symptoms in a long-running session, context management is the fix.

## The Hierarchy of Forgetting

Not all information in the context window is equally valuable. The hierarchy of forgetting is a framework for deciding what to keep and what to discard, organized from highest fidelity to lowest:

1. **In-context (message list).** Perfect fidelity. The full conversation history, tool results, and system prompt. This is what the model sees. It grows every turn.
2. **Summary (compressed digest).** LLM-generated condensation of older conversation segments. Loses exact phrasing and sequential detail. Saves significant space.
3. **Long-term storage (fact files).** Structured facts persisted between sessions. User preferences, project decisions, explicit corrections. Survives session end.
4. **Forgotten.** Information that was in-context but discarded without preservation. Zero cost, zero fidelity.

The key insight: **different categories of information belong at different levels proactively, not as a fallback.** Ephemeral tool results (the contents of a file the agent read for a one-off check) belong at level 4. Drop them early and aggressively. User corrections and explicit preferences belong at level 3, so extract them to storage before they get compressed away. The current working context (the plan the agent is executing, the files it is actively editing) belongs at level 1.

Matching information to its appropriate level is the core skill. Do not wait until the context window is full to start thinking about this.

## Implement Compaction

When context pressure builds, the instinct is to call the LLM and summarize the conversation. That instinct is expensive and usually wrong. Most context pressure is resolvable without any LLM calls at all.

The principle is **cheap interventions first**, ordered by cost:

The following implements a compaction pipeline that applies four strategies in ascending cost order:

```python
function maybe_compact(messages: list, window_size: int) -> list:
  usage = count_tokens(messages)
  headroom = window_size * 0.15

  if usage < window_size - headroom:
    return messages   # no action needed, plenty of room

  # Strategy 1: trim oversized tool results (zero LLM cost)
  messages = trim_large_tool_results(messages, max_chars=5000)
  if count_tokens(messages) < window_size - headroom:
    return messages

  # Strategy 2: drop oldest messages (zero LLM cost)
  messages = drop_oldest_messages(messages, keep_recent=10)
  if count_tokens(messages) < window_size - headroom:
    return messages

  # Strategy 3: summarize older turns (one LLM call, expensive)
  split = len(messages) // 2
  summary = await llm.summarize(messages[:split])
  return [summary_message(summary)] + messages[split:]
```

Why this ordering matters: **tool results are the most common cause of context bloat.** A single verbose file read or search result can consume thousands of tokens while contributing nothing to the agent's working memory after the turn it was used. Trimming tool results to a size cap costs nothing and often frees enough space to avoid any further intervention.

Dropping old messages is next. The first ten turns of a long conversation are usually safe to drop once their content has been acted on. The agent already incorporated that information into its decisions, so the raw messages are redundant.

LLM-driven summarization is the last resort. It costs an API call, it takes time, and it loses information. Use it only when cheaper strategies are insufficient.

> **Tip:** Compact tool results aggressively. They are often 10x larger than conversation turns. A single `search_files` result returning 200 matches can consume as many tokens as the previous 20 conversation turns combined.

## Build a Budget Tracker

Compaction is reactive: it fires when the window is nearly full. A budget tracker is proactive: it monitors token usage continuously and triggers compaction at the right threshold, before the model starts degrading.

The following implements a budget tracker that monitors usage and fires compaction automatically:

```python
class ContextBudgetTracker:
  window_size: int
  compact_threshold: float = 0.80   # compact at 80% usage
  critical_threshold: float = 0.95  # emergency at 95%
  consecutive_failures: int = 0
  max_failures: int = 3

  function check(self, messages: list) -> CompactionAction:
    usage = count_tokens(messages)
    ratio = usage / self.window_size

    if ratio < self.compact_threshold:
      return NO_ACTION

    if self.consecutive_failures >= self.max_failures:
      return NO_ACTION   # circuit breaker: stop retrying

    if ratio >= self.critical_threshold:
      return EMERGENCY_COMPACT   # drop aggressively, skip LLM summary

    return STANDARD_COMPACT

  function record_success(self):
    self.consecutive_failures = 0

  function record_failure(self):
    self.consecutive_failures += 1
```

Two design decisions in this tracker deserve explanation:

**Two thresholds, not one.** The standard threshold (80%) gives the compaction pipeline room to work. It can try cheap strategies first because there is still headroom. The critical threshold (95%) triggers emergency compaction that skips the LLM summarization step and goes straight to dropping messages. At 95%, there is no time for an expensive API call.

**A circuit breaker.** If compaction fails three times in a row (because the context is irrecoverably over limit, or perhaps a single tool result exceeds the entire window), stop retrying. Without this guard, every subsequent turn triggers a doomed compaction attempt that burns an API call and accomplishes nothing.

## Wire It Into the Agent Loop

The budget tracker integrates into the agent loop as a check at the start of each turn, before the LLM call:

```python
async function agent_loop(question: str, max_turns: int = 20) -> str:
  messages = [system_message(prompt), user_message(question)]
  budget = ContextBudgetTracker(window_size=128_000)

  for turn in range(max_turns):
    # Check budget before every LLM call
    action = budget.check(messages)
    if action == STANDARD_COMPACT:
      messages = await maybe_compact(messages, budget.window_size)
      budget.record_success()
    elif action == EMERGENCY_COMPACT:
      messages = emergency_compact(messages, keep_recent=5)
      budget.record_success()

    response = await llm.call(messages)

    if response.tool_calls is empty:
      return response.text

    messages.append(response)
    for call in response.tool_calls:
      result = await dispatch_tool(call.name, call.args)
      messages.append(tool_result(call.id, result))

  raise RuntimeError("agent exceeded max_turns")
```

The check runs before every LLM call, not after. This ensures the model always receives a context-managed message list, even if the previous turn produced a massive tool result.

## What to Compact First

When you need to reclaim space, apply this priority (drop the least valuable first):

1. **Old tool results,** especially large ones. The agent already used them. Trim to a summary or drop entirely.
2. **System cache entries.** Cached context that can be re-fetched if needed later.
3. **Old conversation turns.** The first few turns of a long session. The agent has already incorporated their content into later decisions.
4. **Compacted summaries.** If you have nested summaries (a summary of a summary), the older one can go.
5. **Recent conversation turns.** Drop these only as a last resort. They are the agent's active working memory.

Never drop the system prompt. Never split a tool-call/tool-result pair. The API rejects conversations with a tool result that has no matching tool call.

## Related

- **[Memory and Context](/docs/memory-and-context).** The full memory architecture: the hierarchy of forgetting, the compaction pipeline internals, fact extraction to long-term storage, and circuit breakers for failed compaction.
- **[Agent Loop](/docs/agent-loop).** The loop that grows the message list every turn, and why context pressure builds and where compaction integrates.
- **[Prompt Architecture](/docs/prompt-architecture).** How the system prompt is structured for cache efficiency, and why prompt design affects context budget.


---

# Add MCP to Your Agent

Source: https://claudepedia.dev/docs/add-mcp-to-your-agent
Section: Guides

How to connect an external MCP server and make its tools available to your agent. Covers transport selection, tool bridging, and dispatch integration.

Your agent's built-in tools are limited to what you have implemented. When you need to connect to an external service (a code repository, a database dashboard, a deployment system) you have two options: write a custom tool for each integration, or use a protocol that lets external services expose tools in a standard format your agent already understands.

The Model Context Protocol (MCP) is that standard format. An MCP server advertises its capabilities: names, schemas, and behavioral hints. Your agent connects, discovers what the server offers, and constructs tool objects that are structurally identical to your built-in tools. From that point on, the agent loop dispatches MCP tools and built-in tools through exactly the same code path. The loop never knows the difference.

This guide walks through connecting an MCP server from scratch: choosing a transport, establishing the connection, bridging the server's tools into your agent's format, and wiring them into your existing dispatch. By the end, your agent will have access to tools it did not ship with.

## What MCP Gives You

Without MCP, every external integration is a custom adapter. You write the API client, handle authentication, translate the response format, implement error recovery, and manage the connection lifecycle. Each new service is a new tool with its own maintenance cost.

With MCP, the external service owns all of that. The server implements the API client, handles auth, formats responses, and manages its own lifecycle. Your agent's responsibility is narrow: connect to the server, discover its tools, and route calls through the existing dispatch pipeline.

The result is that adding a new external capability (say, a payment processing service or a monitoring dashboard) requires zero new tool implementations on your side. You add a server configuration entry, and the tools appear.

## Choose a Transport

MCP servers connect via a transport layer that handles the raw communication. The two most common options are:

**stdio.** The server runs as a local subprocess. Your agent spawns it, communicates via stdin/stdout, and kills it when done. This is the simplest transport and the right default for servers you run locally. Startup cost is one process spawn. No network configuration needed.

**HTTP (Streamable HTTP).** The server runs remotely. Communication happens over HTTP with bidirectional support. Use this for remote services, shared team servers, or cloud-hosted integrations. Requires network access and potentially authentication.

The transport is selected at configuration time, not at runtime. Once selected, all communication goes through the same `request()` / `notify()` interface regardless of which transport is underneath. The tool bridge pattern works identically for both.

Here is a transport selection function:

```python
function create_transport(config: ServerConfig) -> Transport:
  if config.type == "stdio":
    return StdioTransport(
      command=config.command,
      args=config.args,
      env=config.env
    )
  elif config.type == "http":
    return HttpTransport(
      url=config.url,
      headers=config.headers
    )
  else:
    raise Error(f"Unknown transport type: {config.type}")
```

> **Tip:** Start with `stdio` for local development and testing. Switch to `http` when you need to share the server across multiple agents or deploy it remotely. The tool bridge code does not change when you switch transports.

## Connect and Discover

The connection lifecycle has three steps: create the transport, establish the connection, and discover what the server offers. The server advertises its capabilities in response to a `tools/list` request.

The following connects to an MCP server and retrieves its tool list:

```python
async function connect_mcp_server(name: str, config: ServerConfig) -> McpConnection:
  # Step 1: Create transport
  transport = create_transport(config)

  # Step 2: Connect
  client = McpClient()
  await client.connect(transport)

  # Step 3: Discover capabilities
  response = await client.request("tools/list")

  return McpConnection(
    name=name,
    client=client,
    raw_tools=response.tools
  )
```

The `tools/list` response contains an array of tool definitions, each with a name, description, input schema (JSON Schema format), and optional annotations. The annotations carry behavioral hints: whether the tool is read-only, whether it is destructive, and other metadata that maps to your agent's concurrency and permission system.

**What can go wrong at this stage:** The server might not start (bad command, missing binary), the connection might fail (network error, auth required), or `tools/list` might return tools with malformed schemas. Handle all three cases before proceeding to the bridge.

The connection has a state that determines whether tools are available:

```python
function get_tools_for_connection(connection: McpConnection) -> list:
  if connection.state == "connected":
    return connection.bridged_tools
  return []   # failed, pending, or disabled: return empty
```

All non-connected states return an empty tool list. The agent loop gets a consistent interface regardless of server health. A server can fail, require auth, or be disabled, and the loop never sees an error. It simply has fewer tools available.

## Bridge the Tools

The tool bridge is the core pattern. It takes the MCP server's tool definitions and constructs tool objects that are structurally identical to your built-in tools. The agent loop and dispatcher cannot tell the difference.

The following bridges MCP tools into your agent's internal format:

```python
function bridge_mcp_tools(connection: McpConnection) -> list:
  bridged = []
  for mcp_tool in connection.raw_tools:
    # Validate the schema before bridging
    if not is_valid_json_schema(mcp_tool.input_schema):
      log_warning(f"Skipping {mcp_tool.name}: invalid schema")
      continue

    agent_tool = {
      name: f"mcp__{connection.name}__{mcp_tool.name}",
      description: truncate(mcp_tool.description, max_chars=2048),
      input_schema: mcp_tool.input_schema,
      concurrency_class: "READ_ONLY" if mcp_tool.annotations.read_only_hint else "WRITE_EXCLUSIVE",
      behavioral_flags: {
        is_destructive: mcp_tool.annotations.destructive_hint or False,
        requires_permission: not mcp_tool.annotations.read_only_hint,
      },
      call: create_mcp_call(connection.client, mcp_tool.name)
    }
    bridged.append(agent_tool)

  return bridged
```

Four things happen in this bridge:

**Namespacing.** The tool name becomes `mcp__{server}__{tool}`. This prevents collisions across multiple servers and makes tool ownership traceable in logs. When you see `mcp__payments__refund_transaction` in a trace, you know immediately which server and which operation.

**Description truncation.** MCP servers control their own descriptions, and some are verbose. A tool description that consumes 10,000 tokens poisons the context window. The model wastes attention on that one description at the expense of everything else. Truncating to 2,048 characters is a safety measure, not a limitation.

**Annotation passthrough.** The server's hints (`read_only_hint`, `destructive_hint`) map directly to your concurrency and permission system. A tool marked read-only can run concurrently with other read-only tools, while a tool marked destructive triggers the permission cascade.

**Schema validation.** If the server returns a tool with a malformed schema, skip it rather than crashing. A single bad tool definition should not prevent the other tools from being available.

> **Tip:** Always validate MCP tool schemas at connection time. A malformed schema does not fail loudly. It fails at dispatch time when the model tries to call the tool and the input cannot be parsed. By then, the model has already committed to a plan that includes that tool. Validate early, skip bad tools, and log the problem.

## Wire Into Dispatch

The final step is merging bridged MCP tools into your existing dispatch pipeline. The dispatcher already knows how to route built-in tools. Adding MCP tools means extending the tool list, not changing the dispatch logic.

The following shows how to combine built-in tools with bridged MCP tools:

```python
function build_tool_list(builtin_tools: list, mcp_connections: list) -> list:
  all_tools = list(builtin_tools)

  for connection in mcp_connections:
    bridged = bridge_mcp_tools(connection)
    all_tools.extend(bridged)

  return all_tools

# The dispatch function is unchanged. It handles all tools identically.
async function dispatch_tool(name: str, args: dict, context: ToolContext) -> ToolResult:
  tool = find_tool(name, context.all_tools)
  if tool is None:
    return error_result(f"Unknown tool: {name}")

  parsed = tool.input_schema.parse(args)
  if not parsed.success:
    return error_result(f"Invalid arguments: {parsed.error}")

  return await tool.call(parsed.data, context)
```

The dispatch function does not distinguish between built-in tools and MCP tools. Both are tool objects with a name, schema, and `call` function. The only difference is that an MCP tool's `call` function routes through the MCP client to the remote server, while a built-in tool's `call` function runs local code. This is the power of the bridge pattern. The abstraction boundary absorbs the complexity.

## Putting It Together

Here is the complete flow from configuration to a working dispatch with MCP tools:

```python
# 1. Configure the MCP server
server_config = {
  name: "github",
  type: "stdio",
  command: "npx",
  args: ["-y", "@mcp/github-server"]
}

# 2. Connect and discover
connection = await connect_mcp_server("github", server_config)

# 3. Bridge tools
connection.bridged_tools = bridge_mcp_tools(connection)

# 4. Merge with built-in tools
all_tools = build_tool_list(builtin_tools, [connection])

# 5. Pass to the agent loop. Dispatch handles everything.
response = await agent_loop(question, tools=all_tools)
```

Five steps: configure, connect, bridge, merge, run. The agent now has access to every tool the MCP server exposes, dispatched through the same pipeline as built-in tools.

When you add a second MCP server (say, a deployment service), the change is one new configuration entry and one new connection. The bridge, merge, and dispatch steps handle it automatically. This is how MCP turns the cost of integration from linear (one custom tool per service) to constant (one bridge pattern for all services).

## Related

- **[MCP Integration](/docs/mcp-integration).** The full MCP architecture: transport details, connection state machines, batched startup for many servers, config scope hierarchy, auth caching, and session expiry handling.
- **[Tool System](/docs/tool-system).** How bridged tools integrate with the dispatch algorithm, concurrency partitioning, and behavioral flag composition.
- **[Safety and Permissions](/docs/safety-and-permissions).** How the permission cascade applies to MCP tools, including the `destructive_hint` and `read_only_hint` annotations.


---

# Build a Multi-Agent System

Source: https://claudepedia.dev/docs/build-a-multi-agent-system
Section: Guides

How to split a single agent into a coordinator and specialized workers. Covers the delegation pattern, context isolation, tool partitioning, and result synthesis.

Some tasks are too complex for a single agent. A research task might need five sources investigated simultaneously. A coding task might need file search, code generation, and test execution happening in parallel. The instinct is to build agents as equal peers that split the work, a flat team where each agent takes a slice. That model fails in practice.

The failure mode is predictable: without a coordinator, agents duplicate work, produce conflicting outputs, and leave no one responsible for assembling a coherent answer. You end up with five partial answers that the user has to reconcile themselves.

Production multi-agent systems use **delegation, not distribution**. One coordinator decides WHAT needs doing. Specialized workers decide HOW to do their assigned piece. The coordinator synthesizes the results into a coherent final answer. This guide walks through building that pattern with a concrete example: a coordinator that delegates to a research agent and a writing agent.

## When to Split

Not every task needs multiple agents. A single agent with the right tools handles most workflows. Split only when you can name distinct responsibilities that require different tool sets.

The heuristic: **if two subtasks need different tools and can run independently, they are candidates for separate agents.** A research subtask needs search and web browsing tools. A writing subtask needs file write and formatting tools. Giving both tool sets to a single agent works, but the agent's tool selection degrades as the tool count grows. More tools means more competition for the model's attention in the schema.

Do not split when:

- The subtasks are sequential (each depends on the previous result). A single agent with a plan handles this better than a coordinator waiting on one worker at a time.
- The subtasks share state heavily. If worker A needs to read what worker B just wrote, you need synchronization, and synchronization between agents is expensive and error-prone.
- You are optimizing prematurely. Start with one agent. Measure where it struggles. Split the specific capability that is causing problems.

> **Tip:** Start with one agent. Split only when you can name distinct responsibilities that require different tool sets. Premature splitting adds coordination overhead without adding capability.

## The Coordinator Pattern

The coordinator receives the user's task, plans the subtasks, delegates to workers, waits for results, and synthesizes a final answer. It does not execute subtasks directly. It only plans and synthesizes.

Here is the coordinator loop:

```python
async function coordinator(task: str, worker_configs: list) -> str:
  # Step 1: Plan - break the task into subtasks
  plan = await llm.call(
    messages=[
      system_message("You are a coordinator. Break this task into independent subtasks."),
      user_message(task)
    ],
    tools=[plan_subtasks_tool]
  )

  # Step 2: Delegate - spawn workers in parallel
  worker_tasks = []
  for subtask in plan.subtasks:
    worker = spawn_worker(
      task=subtask.description,
      context=subtask.required_context,
      tools=subtask.required_tools
    )
    worker_tasks.append(worker)

  # Step 3: Gather - wait for all workers
  results = await gather_all(worker_tasks)

  # Step 4: Synthesize - combine results into a coherent answer
  synthesis = await llm.call(
    messages=[
      system_message("Synthesize these worker results into one coherent answer."),
      user_message(f"Original task: {task}"),
      *[assistant_message(f"Worker '{r.name}': {r.output}") for r in results]
    ]
  )

  return synthesis.text
```

Three design decisions in this structure are critical:

**The coordinator does not have domain tools.** It has only coordination tools: plan, spawn, and synthesize. If the coordinator had file search or web browsing tools, it would use them directly instead of delegating, and you would lose all the parallelism and specialization you designed for. Tool partitioning is what enforces the delegation model.

**Workers run in parallel.** The `gather_all` call dispatches all workers concurrently and waits for all to finish. This parallelism is the primary cost justification for multi-agent architecture. If your workers run sequentially, you have added coordination overhead without adding throughput.

**Synthesis is an LLM call, not concatenation.** The coordinator does not pass raw worker output to the user. It makes another LLM call that combines results, resolves conflicts, fills gaps, and produces an integrated answer. Skip the synthesis step and you have built an expensive router that makes the user do the assembly work.

## Context Isolation

Each worker starts with a fresh message history. Workers cannot read each other's conversations. The coordinator sees only what workers explicitly return, not their full internal reasoning.

The following shows how a worker is initialized with isolated context:

```python
async function spawn_worker(task: str, context: str, tools: list) -> WorkerResult:
  # Fresh message history, no parent conversation leaking in
  messages = [
    system_message(f"You are a specialist. Complete this task:\n{task}"),
    user_message(context)   # only the context this worker needs
  ]

  # Standard agent loop: each worker is a self-contained agent
  for turn in range(max_turns):
    response = await llm.call(messages, tools=tools)

    if response.tool_calls is empty:
      return WorkerResult(name=task, output=response.text)

    messages.append(response)
    for call in response.tool_calls:
      result = await dispatch_tool(call.name, call.args)
      messages.append(tool_result(call.id, result))

  return WorkerResult(name=task, output="Worker exceeded max turns")
```

Isolation is not a limitation. It is a design choice with four properties that make multi-agent systems tractable:

**Prevents error cascades.** A worker that encounters bad data or goes down a wrong reasoning path cannot infect other workers. Its error is contained within its own context. The coordinator sees a failed result and handles it without the error spreading.

**Enables parallel execution.** Workers share no mutable state. There is no shared message history to lock, no race condition on who appends first. Isolation is what makes `gather_all` safe to parallelize without synchronization.

**Makes debugging tractable.** Each worker's conversation is a self-contained artifact. When something goes wrong, you read one worker's message history in isolation and understand exactly what it saw and concluded.

**Provides security isolation.** A worker processing untrusted input (user-uploaded documents, scraped web content) cannot leak that content into the coordinator's context or into other workers. If the content contains a prompt injection, it affects only that worker's narrow task.

## Tool Partitioning

The coordinator and workers have different tools. This partition is what makes delegation work. If the coordinator has all the tools, it will use them directly instead of delegating.

The following shows how tools are assigned:

```python
# Coordinator tools: only coordination capabilities
coordinator_tools = [
  plan_subtasks_tool,    # break a task into subtasks
  spawn_worker_tool,     # create a worker agent
  send_message_tool,     # communicate with an active worker
  terminate_worker_tool  # stop a worker that is stuck
]

# Research worker tools: only research capabilities
research_tools = [
  web_search_tool,
  fetch_url_tool,
  extract_text_tool
]

# Writing worker tools: only writing capabilities
writing_tools = [
  write_file_tool,
  read_file_tool,
  format_document_tool
]
```

The partition follows a simple rule: **each agent gets only the tools it needs for its specific responsibility.** The coordinator needs to plan and delegate, so it gets coordination tools. The research worker needs to find information, so it gets search tools. The writing worker needs to produce documents, so it gets file tools.

If you find that a worker needs a tool from another worker's set, that is a signal that either the subtask boundaries are wrong or you need a third worker. Do not share tool sets across workers. It defeats the purpose of specialization.

## Handle Partial Failure

Not every worker will succeed. A research worker might fail to find a source. A writing worker might exceed its turn budget. The coordinator must decide what to do with partial results.

The following shows a coordinator that handles worker failures:

```python
async function gather_with_fallback(worker_tasks: list) -> list:
  results = []
  for task in worker_tasks:
    try:
      result = await task
      results.append(result)
    except WorkerError as error:
      results.append(WorkerResult(
        name=task.name,
        output=None,
        error=str(error)
      ))

  # Separate successes from failures
  successes = [r for r in results if r.output is not None]
  failures = [r for r in results if r.output is None]

  if len(successes) == 0:
    raise AllWorkersFailedError(failures)

  if failures:
    log_warning(f"{len(failures)} workers failed: {[f.name for f in failures]}")

  return successes
```

The decision of whether to proceed with partial results or fail entirely depends on the task. For research tasks where three of five sources are sufficient, partial success is fine. For tasks where every subtask is critical (all tests must pass), any failure should fail the whole operation. Make this decision explicit in your coordinator, not implicit in the error handling.

## A Complete Example

Here is a coordinator that delegates a documentation task to a research agent and a writing agent:

```python
# The coordinator receives "Write a technical summary of microservices patterns"
async function documentation_coordinator(task: str) -> str:
  # Plan: one research subtask, one writing subtask
  subtasks = [
    Subtask(
      name="research",
      description="Find key patterns, trade-offs, and production considerations for microservices.",
      tools=research_tools,
      context="Focus on service discovery, circuit breakers, and data consistency."
    ),
    Subtask(
      name="writing",
      description="Write a structured technical summary from the research findings.",
      tools=writing_tools,
      context=""   # will be populated with research results
    )
  ]

  # Phase 1: Research runs independently
  research_result = await spawn_worker(
    task=subtasks[0].description,
    context=subtasks[0].context,
    tools=subtasks[0].tools
  )

  # Phase 2: Writing depends on research (sequential, not parallel)
  writing_result = await spawn_worker(
    task=subtasks[1].description,
    context=research_result.output,   # research output becomes writing input
    tools=subtasks[1].tools
  )

  # Synthesize
  return await llm.synthesize(
    original_task=task,
    worker_results=[research_result, writing_result]
  )
```

This example shows a mixed pattern: the research worker runs first (independently), and the writing worker runs second (depends on research output). Not every multi-agent system is fully parallel. The key structural property is still present: each worker has isolated context, specialized tools, and a defined responsibility.

## Related

- **[Multi-Agent Coordination](/docs/multi-agent-coordination).** The full multi-agent architecture: spawning backends, file-based mailbox communication, session reconnection, progressive synthesis, and production trade-offs.
- **[Tool System](/docs/tool-system).** How tool partitioning works at the dispatch level, and why giving the coordinator domain tools defeats the delegation model.
- **[Agent Loop](/docs/agent-loop).** Each worker runs its own agent loop. Understanding the loop is prerequisite to understanding how workers execute.


---

# Implement Permission Controls

Source: https://claudepedia.dev/docs/implement-permission-controls
Section: Guides

How to add a permission layer to your agent. Covers the cascade model, permission modes, bypass-immune checks, and session grants.

Your agent runs tools. Right now, it runs every tool the model asks for, no questions asked. That works during development. In production, it is a liability. An agent without permission controls can delete files, send emails, execute arbitrary shell commands, and modify database records, all because the model decided to. The model is making tool selection decisions based on what it thinks the user wants, and it is often right. But "often right" is not "always right," and the cost of a wrong tool call that deletes a production database is not recoverable.

Permission controls are the layer between "the model wants to call this tool" and "the tool actually runs." This guide walks through adding that layer to an existing agent: a permission cascade that evaluates multiple policy sources, permission modes that provide global overrides, bypass-immune checks that hold even when the user says "allow everything," and session grants that let users approve tools for the duration of a session.

## The Problem

Without permission controls, your agent's dispatch function looks like this:

```python
async function dispatch_tool(name: str, args: dict) -> ToolResult:
  tool = registry.get(name)
  return await tool.implementation(args)
```

Every tool call the model generates goes straight to execution. The model asks for `delete_file`, and the file is deleted. The model asks for `send_email`, and the email is sent. There is no check, no confirmation, no audit trail.

The consequences scale with the tool set:

- An agent with `write_file` can overwrite configuration files, breaking the deployment.
- An agent with `execute_command` can run `rm -rf /` if the model generates that input.
- An agent with `send_api_request` can make unauthorized calls to external services.

The fix is not to remove dangerous tools. The agent needs them to do its job. The fix is to control when and how they run.

## The Permission Cascade

A permission check needs to evaluate multiple sources. The project might have a policy that denies shell execution. The user's personal settings might allow file writes. A CLI flag might grant extra permissions for this session. These sources can disagree, and when they do, you need a deterministic resolution.

The cascade model resolves this: evaluate sources in priority order, and the first source with an opinion wins. If no source has an opinion, the default is **deny**. Fail-closed.

The following implements a permission cascade with three sources (expand to six for a full production system):

```python
SOURCES = ["project_policy", "user_settings", "session_grants"]

function evaluate_permission(tool_name: str, context: PermissionContext) -> Decision:
  for source in SOURCES:
    rules = context.get_rules(source)

    # Check deny rules first: most restrictive wins within a source
    for rule in rules.deny:
      if rule.matches(tool_name):
        return Decision(action="DENY", source=source, reason=rule.reason)

    # Check allow rules
    for rule in rules.allow:
      if rule.matches(tool_name):
        return Decision(action="ALLOW", source=source, reason=rule.reason)

  # No source had an opinion, fail-closed
  return Decision(action="DENY", source="default", reason="no matching rule")
```

Three properties of this cascade are worth naming:

**First-match wins.** Each source returns ALLOW, DENY, or no opinion. The cascade stops at the first non-abstaining source. Given any tool call and context, you can trace exactly which source made the decision. No ambiguity.

**Deny beats allow within a source.** Deny rules are checked before allow rules. Within any single source, the most restrictive opinion wins. A project policy that denies shell execution cannot be overridden by an allow rule in the same policy file.

**Every decision is logged.** The audit trail carries the matched source and the reason. When debugging an unexpected denial, you can see exactly which rule fired and from which configuration level.

> **Tip:** Default to deny for any tool you have not explicitly classified. Fail-closed is the only safe default. An unclassified tool that runs by default is a security hole waiting to be discovered.

## Permission Modes

The cascade handles per-tool decisions. Permission modes provide a global override that changes the default behavior for entire categories of tools. Modes are useful when you want the agent to operate in a specific posture without configuring every tool individually.

The five modes:

```python
type PermissionMode = "default" | "plan" | "accept_edits" | "bypass" | "silent_deny"

function apply_mode(mode: PermissionMode, tool: ToolConfig) -> Decision or None:
  if mode == "plan":
    # Read-only mode: all write tools auto-denied
    if tool.has_write_effect:
      return DENY("plan mode: write tools blocked")
    return None   # reads fall through to cascade

  if mode == "accept_edits":
    # File edits auto-approved, everything else falls through
    if tool.is_file_edit:
      return ALLOW
    return None

  if mode == "bypass":
    # Everything auto-approved (requires explicit opt-in)
    return ALLOW

  if mode == "silent_deny":
    # Everything not explicitly allowed is silently denied
    return DENY("silent deny mode")

  # Default mode: fall through to cascade for all tools
  return None
```

**Plan mode** is the most common in practice. When the agent is planning (deciding what to do before doing it), you want it to be able to read files and search but not write, execute, or send. Plan mode enforces this without touching the cascade configuration.

**Bypass mode** auto-approves everything. Use it only in controlled environments (CI pipelines, automated testing) where the tool set is already constrained. Never expose bypass mode as a user-facing option without understanding the implications.

Modes are evaluated before the cascade. If the mode has an opinion, the cascade is skipped entirely for that tool call.

## Bypass-Immune Checks

Some safety rules must hold regardless of the permission mode or cascade configuration. If the agent is scoped to a specific directory, it must not write files outside that directory, even if bypass mode is active. If the agent has a budget limit, it must not exceed it, even if every tool is allowed.

Bypass-immune checks run before the cascade and before mode evaluation. They cannot be overridden by any policy source:

```python
function check_bypass_immune(tool_name: str, args: dict, context: PermissionContext) -> Decision or None:
  # Scope check: agent must not act outside its designated directory
  if is_file_operation(tool_name):
    target_path = extract_path(args)
    if not is_within_scope(target_path, context.allowed_directories):
      return DENY("scope violation: path outside allowed directories")

  # Budget check: agent must not exceed cost limit
  if context.budget_remaining <= 0:
    return DENY("budget exceeded")

  # Network check: agent must not access blocked domains
  if is_network_operation(tool_name):
    target = extract_host(args)
    if target in context.blocked_hosts:
      return DENY(f"blocked host: {target}")

  return None   # no bypass-immune violation, proceed to mode/cascade
```

The critical design property: **bypass-immune checks are not policy.** They are invariants. If they were inside the cascade, a sufficiently privileged policy source could override them. Running them before the cascade makes them unconditional.

## Wire Into Dispatch

The permission layer integrates into the dispatch function between argument parsing and execution:

```python
async function dispatch_tool(name: str, args: dict, context: ToolContext) -> ToolResult:
  tool = registry.get(name)
  if tool is None:
    return error_result(f"Unknown tool: {name}")

  # Parse arguments
  parsed = tool.schema.parse(args)
  if not parsed.success:
    return error_result(f"Invalid arguments: {parsed.error}")

  # Permission check: bypass-immune -> mode -> cascade
  immune_check = check_bypass_immune(name, parsed.data, context.permissions)
  if immune_check is not None:
    log_permission(name, immune_check)
    return error_result(f"Permission denied: {immune_check.reason}")

  mode_check = apply_mode(context.permissions.mode, tool.config)
  if mode_check is not None:
    if mode_check.action == "DENY":
      log_permission(name, mode_check)
      return error_result(f"Permission denied: {mode_check.reason}")
  else:
    cascade_check = evaluate_permission(name, context.permissions)
    if cascade_check.action == "DENY":
      log_permission(name, cascade_check)
      return error_result(f"Permission denied: {cascade_check.reason}")

  # Permission granted, execute
  return await tool.implementation(parsed.data, context)
```

The order is fixed: bypass-immune checks first, then mode, then cascade. Each layer can short-circuit. If bypass-immune denies, mode and cascade never evaluate. This ordering ensures that the strongest constraints are always checked first.

## Session Grants

Users do not want to approve every tool call individually. After approving `write_file` once, they want it to run freely for the rest of the session. Session grants provide this UX:

```python
class SessionGrantStore:
  grants: dict = {}   # tool_name -> GrantDecision

  function grant(self, tool_name: str, scope: str = "session"):
    self.grants[tool_name] = GrantDecision(
      action="ALLOW",
      scope=scope,
      granted_at=now()
    )

  function check(self, tool_name: str) -> Decision or None:
    grant = self.grants.get(tool_name)
    if grant is not None:
      return Decision(action="ALLOW", source="session", reason="session grant")
    return None
```

Session grants are the lowest priority in the cascade. They can be overridden by project policy, user settings, or any higher source. They expire when the session ends. They cannot override bypass-immune checks.

The typical UX flow: the model requests a tool call, the cascade returns ASK (no allow or deny rule matched), the user is prompted, the user approves and checks "allow for this session," and a session grant is recorded. Subsequent calls to the same tool skip the prompt.

## Related

- **[Safety and Permissions](/docs/safety-and-permissions).** The full permission architecture: six cascade sources, five permission modes, denial tracking with escalation thresholds, shadow rule detection, and multi-agent permission forwarding.
- **[Tool System](/docs/tool-system).** How tools carry permission metadata alongside their schema, and how behavioral flags like `is_destructive` feed into the permission cascade.
- **[Multi-Agent Coordination](/docs/multi-agent-coordination).** How permissions are forwarded when a coordinator spawns workers, and why workers inherit a restricted permission scope.


---

# Debug Your Agent

Source: https://claudepedia.dev/docs/debug-your-agent
Section: Guides

How to find out what your agent is doing wrong. Covers structured event logging, cost tracking, session tracing, and the signatures of common failure patterns.

Your agent is doing something wrong. Maybe it is calling the wrong tool. Maybe it is looping endlessly. Maybe it produces a plausible-looking answer that is quietly incorrect. Standard debugging techniques (breakpoints, stack traces, print statements) are not enough. Agent failures are non-deterministic (the same input can produce different behavior), context-dependent (the failure depends on what happened five turns ago), and often silent (the agent does not crash, it just gives a wrong answer with confidence).

Debugging an agent requires a different approach: structured observability that lets you reconstruct what happened, what it cost, and where time went. This guide builds that observability in three layers, then shows you the signatures of the most common failure patterns so you know what to look for.

## Layer 1: Structured Event Logging

The first question when debugging is "what happened?" The answer should come from a structured event log: a timeline of named events with typed metadata that you can query after the fact.

The key word is **structured**. A log line like `"Tool called successfully"` tells you nothing useful. A structured event like `tool_called duration_ms=450 success=true retry_count=0` tells you exactly what happened, how long it took, and whether it retried.

The following sets up a structured event logger:

```python
class EventLogger:
  events: list = []

  function log(self, event_name: str, metadata: dict):
    entry = {
      timestamp: now_ms(),
      event: event_name,
      **metadata
    }
    self.events.append(entry)

  function query(self, event_name: str) -> list:
    return [e for e in self.events if e.event == event_name]

logger = EventLogger()
```

Wire the logger into every decision point in the agent loop:

```python
async function agent_loop_with_logging(question: str, tools: list) -> str:
  messages = [system_message(prompt), user_message(question)]
  logger.log("session_start", { turn_count: 0 })

  for turn in range(max_turns):
    logger.log("turn_start", { turn: turn, message_count: len(messages) })

    response = await llm.call(messages, tools=tools)
    logger.log("llm_response", {
      turn: turn,
      has_tool_calls: len(response.tool_calls) > 0,
      token_count: response.usage.total_tokens
    })

    if response.tool_calls is empty:
      logger.log("session_complete", { turns_used: turn + 1 })
      return response.text

    for call in response.tool_calls:
      start = now_ms()
      result = await dispatch_tool(call.name, call.args)
      logger.log("tool_called", {
        turn: turn,
        tool: call.name,
        duration_ms: now_ms() - start,
        success: not result.is_error,
        result_size: len(str(result))
      })
      messages.append(tool_result(call.id, result))

  logger.log("session_timeout", { turns_used: max_turns })
  raise RuntimeError("agent exceeded max_turns")
```

Every significant event (session start, turn start, LLM response, tool call, session end) gets a structured log entry with typed metadata. When debugging, you can query the log to answer specific questions: "Which tool took the longest?" "How many tokens did turn 5 use?" "Did any tool call fail?"

> **Tip:** Log the model's tool selection reasoning alongside the call when available. When the agent picks the wrong tool, the reasoning tells you why, and the fix is usually in the tool schema descriptions, not in the tool implementation.

## Layer 2: Cost Tracking

Agents are expensive. Each LLM call costs tokens, and tokens cost money. Cost tracking serves two purposes: real-time budget awareness during a session, and post-session analysis to understand where money went.

The following tracks costs at three scopes (per call, per session, and per model):

```python
class CostTracker:
  total_cost_usd: float = 0.0
  total_input_tokens: int = 0
  total_output_tokens: int = 0
  total_cache_read_tokens: int = 0
  per_model: dict = {}
  per_turn: list = []

  function record(self, model: str, usage: TokenUsage):
    cost = calculate_cost(model, usage)
    self.total_cost_usd += cost
    self.total_input_tokens += usage.input_tokens
    self.total_output_tokens += usage.output_tokens
    self.total_cache_read_tokens += usage.cache_read_tokens

    if model not in self.per_model:
      self.per_model[model] = ModelCost(cost=0.0, calls=0)
    self.per_model[model].cost += cost
    self.per_model[model].calls += 1

    self.per_turn.append(TurnCost(model=model, cost=cost, tokens=usage))

  function is_runaway(self) -> bool:
    if len(self.per_turn) < 5:
      return False
    recent = self.per_turn[-5:]
    # Runaway signal: cost per turn is increasing consistently
    return all(recent[i].cost > recent[i-1].cost for i in range(1, len(recent)))
```

The `is_runaway` method detects a specific failure pattern: the agent is in a loop where each iteration is more expensive than the last. This happens when the agent keeps accumulating context (tool results, conversation turns) without compacting, and each subsequent LLM call processes a larger input. An increasing cost trajectory over five consecutive turns is a strong signal that something is wrong.

**Track four token types separately.** Input tokens and output tokens are billed at standard rates. Cache-read tokens (served from the provider's prompt cache) are billed at a fraction of the input rate. If you aggregate all tokens into a single counter, you cannot tell whether prompt caching is working. A high ratio of cache-read tokens to input tokens means caching is effective. A low ratio means you are paying full price for prompts that should be cached.

## Layer 3: Session Tracing

Event logging tells you what happened. Cost tracking tells you what it cost. Tracing tells you where time went and how operations relate to each other.

A trace is a tree of spans, each representing a timed operation:

```python
session trace
  |-- llm_request (2.3s)
  |-- tool: search_files (0.4s)
  |-- llm_request (1.8s)
  |-- tool: read_file (0.1s)
  |-- tool: write_file (0.2s)
  |     |-- permission_check (0.05s)
  |-- llm_request (1.5s)
```

The following implements trace context propagation using async-local storage:

```python
class TraceContext:
  spans: list = []
  active_span: Span or None = None

  function start_span(self, name: str, parent: Span or None = None) -> Span:
    span = Span(
      name=name,
      start_time=now_ms(),
      parent=parent or self.active_span,
      children=[]
    )
    if span.parent:
      span.parent.children.append(span)
    self.spans.append(span)
    self.active_span = span
    return span

  function end_span(self, span: Span):
    span.end_time = now_ms()
    span.duration_ms = span.end_time - span.start_time
    self.active_span = span.parent
```

The trace context propagates automatically through async operations. When the agent loop starts a `tool` span, any sub-operations (permission checks, file I/O) automatically become children of that span. The resulting trace tree shows exactly where time was spent and how operations nested.

Use traces to answer questions like: "Why did turn 3 take 8 seconds?" (the tool call inside it hit a timeout and retried). "Why is the session so expensive?" (one LLM call is processing 100K tokens because context was not compacted).

## What Are the Most Common Agent Failure Patterns?

Once you have structured logs, cost tracking, and traces, you can identify the most common agent failure patterns by their signatures:

**Infinite tool loop.** The agent calls the same tool repeatedly with the same or similar arguments. Signature in logs: the same tool name appears in consecutive turns, the cost tracker shows an increasing trajectory, and the turn count approaches `max_turns`.

```python
# Signature: same tool called repeatedly
turn 12: tool=search_files args={pattern: "config.yml"}
turn 13: tool=search_files args={pattern: "config.yml"}
turn 14: tool=search_files args={pattern: "config.yaml"}
turn 15: tool=search_files args={pattern: "config.yml"}
```

The fix is usually in the tool's error response. If the tool returns an unhelpful error ("not found"), the model retries. Return a specific, actionable error ("no file matching 'config.yml' exists in /project. Available config files: settings.json, env.toml") and the model moves on.

**Context overflow.** The agent's responses degrade: shorter, less accurate, ignoring earlier instructions. Signature: the `message_count` in turn logs is high (30+), the `token_count` per turn is near the context window limit, and the `result_size` of tool calls is large (10K+ characters).

The fix is compaction. See [Manage Conversation Context](/docs/manage-conversation-context) for the compaction pipeline.

**Wrong tool selection.** The agent picks a tool that does not match the task. Signature: a tool call fails with a semantic error (not a schema error), or the tool succeeds but the result is irrelevant to the conversation. Look at the tool schema descriptions. If the description is vague, the model cannot distinguish between similar tools.

The fix is better schema descriptions. Add explicit guidance on when to use and when not to use each tool. The model reads descriptions to make selection decisions.

**Silent wrong answers.** The agent produces a confident answer that is factually incorrect. This is the hardest failure to detect because there is no error signal. The signature is absence: no tool errors, no timeout, no loop, just a wrong answer delivered smoothly.

The fix requires verification hooks or human review. Add a post-completion check that validates the answer against the tool results that were used to produce it. If the agent claims "the file contains X" but the `read_file` result shows something different, flag the discrepancy.

## Related

- **[Observability and Debugging](/docs/observability-and-debugging).** The full observability architecture: sink queue pattern for event logging, metadata type restrictions for PII safety, per-model cost tracking with cache separation, span hierarchy with async-local propagation, and orphan span cleanup.
- **[Error Recovery](/docs/error-recovery).** Once you have identified the failure, the escalation ladder (retry, fallback, degrade, fail) tells you what to do next.
- **[Agent Loop](/docs/agent-loop).** Understanding the loop lifecycle helps you interpret traces and identify which phase of the loop is causing problems.


---

# Stream Agent Responses

Source: https://claudepedia.dev/docs/stream-agent-responses
Section: Guides

How to add streaming to your agent loop so users see output as it generates. Covers typed events, the producer-consumer pipeline, and backpressure handling.

A user asks your agent a question. The agent makes an LLM call, waits for the full response, dispatches a tool, waits again, makes another LLM call, waits again, and finally returns the answer. The user stares at a blank screen for 15 seconds. They wonder if the agent is stuck. They consider reloading the page.

Streaming fixes this. Instead of waiting for the complete response, the agent yields events as they happen: text tokens as they generate, tool calls as they dispatch, results as they arrive. The user sees text appearing, tools being called, progress being made. Perceived latency drops from seconds to milliseconds, and the user can read the answer while it is still being generated.

But naive streaming (printing raw tokens to the screen) misses the architectural opportunity. A well-designed event system makes agent output composable, observable, and safe under load. This guide builds that system: a typed event model, a streaming agent loop, a consumer that renders events progressively, and backpressure handling for when the consumer cannot keep up.

## The Event Model

The first decision is what to stream. Raw text chunks are not enough. The consumer needs to know whether it is receiving agent text, a tool call notification, a tool result, or a completion signal. Without this distinction, the consumer has to guess what each chunk means, and guessing leads to rendering bugs.

The solution is a **typed event model**: a set of named event types that the agent loop produces and consumers handle:

```python
type AgentEvent =
  | TextDelta      { turn_id: str, text: str }
  | ToolDispatch   { turn_id: str, tool: str, args: dict }
  | ToolResult     { turn_id: str, tool: str, result: str, success: bool }
  | Complete       { turn_id: str, final_text: str }
  | ErrorEvent     { turn_id: str, error: str }
```

Each event carries a `turn_id` that groups events from the same agent turn. A consumer can use the `turn_id` to associate a `ToolResult` with the `ToolDispatch` that started it, or to know which `TextDelta` events belong to the final answer versus an intermediate reasoning step.

The event model makes streaming **composable**. Any number of consumers can subscribe to the same event stream: a UI renderer, a logging system, a supervisor agent. Each consumer handles the event types it cares about and ignores the rest. Adding a new consumer requires zero changes to the agent loop.

## Modify the Agent Loop

The standard agent loop from the quickstart returns a string. A streaming agent loop returns an async generator that yields events:

```python
async function* streaming_agent_loop(question: str, tools: list) -> AsyncIterator[AgentEvent]:
  messages = [system_message(prompt), user_message(question)]

  for turn in range(max_turns):
    turn_id = generate_id()

    # Stream the LLM response: yield text deltas as they arrive
    full_response = empty_response()
    async for chunk in llm.stream(messages, tools=tools):
      if chunk.text:
        yield TextDelta(turn_id=turn_id, text=chunk.text)
      full_response = merge(full_response, chunk)

    # Check termination
    if full_response.tool_calls is empty:
      yield Complete(turn_id=turn_id, final_text=full_response.text)
      return

    # Dispatch tools and yield events for each
    messages.append(full_response)
    for call in full_response.tool_calls:
      yield ToolDispatch(turn_id=turn_id, tool=call.name, args=call.args)
      result = await dispatch_tool(call.name, call.args)
      yield ToolResult(
        turn_id=turn_id,
        tool=call.name,
        result=truncate(str(result), max_chars=500),
        success=not result.is_error
      )
      messages.append(tool_result_message(call.id, result))

  yield ErrorEvent(turn_id="overflow", error="agent exceeded max_turns")
```

Two changes from the standard loop:

**`llm.stream()` instead of `llm.call()`.** The streaming API returns an async iterator of chunks instead of a complete response. Each chunk may contain a text fragment, and we yield a `TextDelta` event for each one. The chunks are also merged into `full_response` so we can check for tool calls after the stream completes.

**`yield` instead of `return`.** The function is an async generator (`async function*`). It yields events throughout execution and only returns (implicitly, at the end) when the agent completes. The caller consumes events as they arrive rather than waiting for the final result.

## Build a Consumer

A consumer processes the event stream and does something useful with each event: rendering to a UI, logging to disk, or forwarding over a network connection.

The following renders events to a terminal UI:

```python
async function render_to_terminal(event_stream: AsyncIterator[AgentEvent]):
  async for event in event_stream:
    match event:
      TextDelta:
        terminal.write(event.text)   # append text as it arrives

      ToolDispatch:
        terminal.write_line(f"\n  Calling {event.tool}...")

      ToolResult:
        if event.success:
          terminal.write_line(f"  {event.tool} completed.")
        else:
          terminal.write_line(f"  {event.tool} failed.")

      Complete:
        terminal.write_line("\n---\nDone.")

      ErrorEvent:
        terminal.write_line(f"\nError: {event.error}")
```

The consumer does not know how the events were produced. It does not know about the agent loop, the LLM, or the tools. It only knows the event types and what to do with each one. This decoupling is the value of the typed event model.

You can have multiple consumers processing the same stream. A logging consumer records every event to disk. A supervisor consumer watches for anomalies (too many tool calls, increasing cost). A network consumer serializes events to a WebSocket or SSE connection. Each consumer subscribes independently:

```python
async function fan_out(event_stream: AsyncIterator[AgentEvent], consumers: list):
  async for event in event_stream:
    for consumer in consumers:
      await consumer.handle(event)
```

## Handle Backpressure

The producer (agent loop) can generate events faster than the consumer can process them. A `TextDelta` event arrives every few milliseconds during streaming. A slow UI renderer or a network consumer with latency can fall behind. This is **backpressure**, the consumer pushing back against a fast producer.

Three strategies exist:

**No buffer (blocking producer).** The generator suspends at each `yield` until the consumer calls `next()`. The producer never runs ahead of the consumer. Maximum safety, but the producer is throttled to the consumer's speed.

**Bounded buffer.** The producer runs ahead up to N events, then blocks. Absorbs consumer jitter, meaning a consumer that processes events in bursts. The buffer size is the explicit trade-off: larger means smoother throughput, more memory use, and more events potentially lost if the consumer crashes.

**Unbounded buffer.** The producer never blocks. All events are queued immediately. Maximum throughput but unbounded memory use. Safe only when the consumer is reliably faster than the producer.

For UI streaming, the standard choice is a **bounded buffer**:

```python
async function buffered_consumer(event_stream: AsyncIterator[AgentEvent], buffer_size: int = 20):
  buffer = AsyncQueue(maxsize=buffer_size)

  async function fill():
    async for event in event_stream:
      await buffer.put(event)   # blocks if buffer is full
    await buffer.put(SENTINEL)

  spawn_background(fill)

  while True:
    event = await buffer.get()
    if event is SENTINEL:
      return
    yield event
```

A buffer of 20 events handles the bursty render patterns of a typical UI without risking memory exhaustion. The producer blocks if it generates more than 20 events before the consumer processes any, which means a slow consumer throttles the producer naturally instead of letting the buffer grow without bound.

> **Tip:** Buffer `ToolDispatch` events until the corresponding `ToolResult` arrives. Showing "calling search_files..." and then immediately "search_files failed" is worse UX than waiting a moment and showing the complete outcome. Batch tool lifecycle events in the consumer, not the producer.

## Progressive Disclosure

Streaming is not just about speed. It is about building trust. A user who can see the agent working trusts the agent more than one who sees a loading spinner. Progressive disclosure means showing useful intermediate state:

- **Text tokens as they generate.** The user starts reading before the response is complete.
- **Tool calls as they dispatch.** The user sees which tools the agent is using and can assess whether the approach is reasonable.
- **Partial results before completion.** If the agent is synthesizing from multiple sources, show each source's contribution as it arrives.

The agent loop already yields events at the right granularity. The consumer decides how to render them progressively. A simple consumer might show all events in order. A sophisticated consumer might group tool events, batch rapid `TextDelta` events for smoother rendering, and show a summary line for each completed tool rather than the full result.

## Error Events

Errors during streaming must not break the event contract. If the LLM call fails or a tool throws an exception, the consumer needs to know, but it should receive an `ErrorEvent`, not a raw exception that terminates the stream.

The streaming agent loop wraps errors and yields them as events:

```python
async function* safe_streaming_loop(question: str, tools: list) -> AsyncIterator[AgentEvent]:
  try:
    async for event in streaming_agent_loop(question, tools):
      yield event
  except LLMError as error:
    yield ErrorEvent(turn_id="error", error=f"LLM call failed: {error}")
  except ToolError as error:
    yield ErrorEvent(turn_id="error", error=f"Tool error: {error}")
  except Exception as error:
    yield ErrorEvent(turn_id="error", error=f"Unexpected error: {error}")
```

The consumer handles `ErrorEvent` like any other event type: displaying an error message, logging the failure, or triggering a retry. The stream contract is preserved regardless of what goes wrong inside the agent loop.

## Related

- **[Streaming and Events](/docs/streaming-and-events).** The full streaming architecture: the complete event type system, priority-based dispatch for terminal UIs, capture/bubble phases, screen-diffing output models, and the generator connection pattern.
- **[Agent Loop](/docs/agent-loop).** The base loop pattern that the streaming loop modifies. Understanding how `llm.call()` becomes `llm.stream()` and why the loop structure stays the same.
- **[Tool System](/docs/tool-system).** How tool dispatch integrates with the streaming loop, and why tool results are yielded as events.


---

# Extend with Hooks

Source: https://claudepedia.dev/docs/extend-with-hooks
Section: Guides

How to add lifecycle hooks to your agent without modifying core loop code. Covers the four execution modes, condition-based filtering, and error isolation.

You need to add audit logging to every file-write tool call. Or validate tool arguments against a security policy before execution. Or send a webhook notification every time the agent completes a task. The naive approach is to add this logic directly into each tool's implementation: an `if` block here, a logging call there, a webhook POST at the end.

That approach does not scale. When you have 20 tools and want to add audit logging to all of them, you edit 20 files. When the audit format changes, you edit 20 files again. When you want to add a new cross-cutting concern (cost tracking, content moderation, input sanitization), you edit all 20 files a third time. The core loop and tool implementations become entangled with concerns they should not know about.

Hooks solve this by providing **lifecycle extension points**: named events that fire at defined moments in the agent's execution. You register a hook for an event, optionally with a condition, and the hook runner invokes it at the right time. The agent loop does not know your hook exists. The tools do not know your hook exists. You add observability, validation, and transformation without touching a single line of existing code.

## The Four Execution Modes

Every hook has an execution mode that determines how it runs. The mode is the primary design decision. It sets the cost, capability, and latency of your hook:

**Command.** Runs a shell script or binary. Fast, deterministic, and cheap. Use for validation logic you can express in a script: linting, format checking, file existence checks, policy enforcement via external tools.

**Prompt.** Makes a single LLM call with a small, fast model. Use for classification that requires language understanding: "Is this tool input requesting a destructive operation?" "Does this user prompt contain sensitive information?" The LLM call adds latency but handles cases that string matching cannot.

**Agent.** Spawns a full multi-turn sub-agent with access to tools. Use for complex verification that requires exploration: "Read the test output and verify that all tests pass." "Check that the implementation matches the specification." Expensive, so use sparingly.

**HTTP.** Sends an HTTP POST to an external endpoint. Use for webhooks, audit logs, and third-party integrations. Can run asynchronously (fire-and-forget) to avoid adding latency to the agent loop.

The following shows how mode selection maps to common use cases:

```python
# Command mode: validate TypeScript files with a linter before write
hook_config = {
  event: "PreToolUse",
  matcher: "Write",
  mode: "command",
  command: "lint-check --file $tool_input_path",
  condition: "Write(src/**/*.ts)",
  timeout: 10
}

# Prompt mode: classify user input for content moderation
hook_config = {
  event: "UserPromptSubmit",
  mode: "prompt",
  prompt: "Is this user input requesting something harmful? Respond YES or NO.",
  timeout: 5
}

# HTTP mode: send audit event on every tool completion (fire-and-forget)
hook_config = {
  event: "PostToolUse",
  mode: "http",
  url: "http://localhost:9000/audit",
  async_mode: True   # fire-and-forget, do not block the agent
}
```

> **Tip:** Start with `command` mode. If your validation logic can be expressed as a script or CLI tool, it runs faster and more predictably than an LLM call. Reach for `prompt` and `agent` modes only when you need language understanding or multi-step reasoning.

## Register Hooks

A hook declaration binds an execution mode to a lifecycle event. The event determines when the hook fires. The optional condition determines whether it fires for this specific invocation.

The key lifecycle events:

- **PreToolUse.** Fires before a tool executes. Can modify arguments or block execution.
- **PostToolUse.** Fires after a tool succeeds. Can modify or augment the output.
- **Stop.** Fires when the agent decides to stop. Use for end-of-task verification.
- **SessionStart / SessionEnd.** Fire at session boundaries. Use for setup and cleanup.
- **UserPromptSubmit.** Fires before the user's prompt reaches the agent. Use for input validation.

The following registers a pre/post hook pair that audits file writes:

```python
class HookRegistry:
  hooks: dict = {}   # event_name -> list of hook configs

  function register(self, event: str, hook: HookConfig):
    if event not in self.hooks:
      self.hooks[event] = []
    self.hooks[event].append(hook)

  function get_hooks(self, event: str) -> list:
    return self.hooks.get(event, [])

registry = HookRegistry()

# Pre-hook: log the write attempt before it happens
registry.register("PreToolUse", HookConfig(
  matcher="Write",
  mode="command",
  command="echo 'AUDIT: Write attempt to $tool_input_path' >> /var/log/agent-audit.log",
  condition="Write(src/**/*)"
))

# Post-hook: verify the written file is valid
registry.register("PostToolUse", HookConfig(
  matcher="Write",
  mode="command",
  command="file-validator $tool_input_path",
  condition="Write(src/**/*.ts)",
  timeout=5
))
```

## Condition Syntax

The condition field gates when a hook fires. Without a condition, the hook fires for every invocation of the matched event. With a condition, it fires only when the condition matches.

Conditions use a simple pattern syntax:

```python
# Match any Write to TypeScript files in src/
condition: "Write(src/**/*.ts)"

# Match any Bash command starting with 'git push'
condition: "Bash(git push*)"

# No condition: matches every invocation of the matched event
condition: None
```

The condition is evaluated before the hook executor is spawned. A non-matching condition means no process, no LLM call, no HTTP request. This makes conditions a cost-saving mechanism. A `PreToolUse` hook that fires for every Write call to every file wastes resources on build artifacts, logs, and temporary files that you do not care about. Write the most specific condition you can.

## Error Isolation

A hook that crashes, times out, or produces an error must not crash the main agent loop. Error isolation is the property that makes hooks safe to add in production. You can register a hook that occasionally fails without worrying about destabilizing the agent.

Every hook execution produces one of four outcomes:

- **Success.** Hook ran and returned a result.
- **Blocking.** Hook explicitly blocked the operation (e.g., a pre-tool hook that rejects the arguments).
- **Non-blocking error.** Hook failed but execution continues. The error is logged.
- **Cancelled.** Hook was aborted (timeout expired, parent operation cancelled).

The hook runner aggregates results from multiple hooks on the same event:

```python
async function run_hooks(event: str, context: HookContext) -> HookResult:
  hooks = registry.get_hooks(event)
  results = []

  for hook in hooks:
    # Check condition first, skip if no match
    if hook.condition and not matches(hook.condition, context):
      continue

    try:
      result = await execute_hook(hook, context, timeout=hook.timeout)
      results.append(result)
    except TimeoutError:
      results.append(HookOutcome(status="cancelled", hook=hook.name))
    except Exception as error:
      results.append(HookOutcome(status="non_blocking_error", error=str(error)))

  # Aggregation: any blocking result blocks the operation
  blocking = [r for r in results if r.status == "blocking"]
  if blocking:
    return HookResult(blocked=True, reasons=[r.reason for r in blocking])

  # Non-blocking errors are logged, execution continues
  errors = [r for r in results if r.status == "non_blocking_error"]
  if errors:
    for error in errors:
      log_warning(f"Hook error (non-blocking): {error}")

  return HookResult(blocked=False)
```

Two aggregation rules:

**A blocking result from any hook blocks the entire operation.** If one pre-tool hook says "this argument is dangerous," the tool does not run, regardless of what other hooks returned. Blocking is a veto, not a vote.

**Non-blocking errors are logged but do not stop execution.** A hook that occasionally times out is not a production incident. It logs, and the agent continues. This safety property is what makes it reasonable to use `prompt` and `agent` hooks in production paths where the underlying LLM might occasionally be slow.

## Wire Into the Agent Loop

The hook runner integrates into the dispatch pipeline at two points: before tool execution (PreToolUse) and after (PostToolUse):

```python
async function dispatch_with_hooks(name: str, args: dict, context: ToolContext) -> ToolResult:
  tool = registry.get(name)

  # Pre-tool hooks: can block execution or modify arguments
  pre_result = await run_hooks("PreToolUse", HookContext(
    tool_name=name,
    tool_input=args,
    context=context
  ))
  if pre_result.blocked:
    return error_result(f"Blocked by hook: {pre_result.reasons}")

  # Execute the tool
  result = await tool.implementation(args, context)

  # Post-tool hooks: can modify output (fire-and-forget for async hooks)
  await run_hooks("PostToolUse", HookContext(
    tool_name=name,
    tool_input=args,
    tool_output=result,
    context=context
  ))

  return result
```

The dispatch function does not know what the hooks do. It knows when to call the hook runner (before and after execution) and how to handle the result (block if blocked, continue otherwise). Adding a new hook means adding a registry entry. The dispatch logic is unchanged.

## A Complete Example: Audit Hook

Here is a complete audit hook that logs every file-write operation to an external service:

```python
# Register an audit hook for all Write operations
registry.register("PostToolUse", HookConfig(
  matcher="Write",
  mode="http",
  url="http://localhost:9000/audit",
  async_mode=True,   # fire-and-forget
  payload_template={
    "event": "file_written",
    "path": "$tool_input_path",
    "timestamp": "$timestamp",
    "agent_session": "$session_id"
  }
))
```

This hook fires after every successful Write operation, sends a JSON payload to the audit service, and does not wait for a response (fire-and-forget). If the audit service is down, the hook produces a non-blocking error (logged, but the agent continues writing files). The audit hook adds observability without adding latency or fragility.

> **Tip:** Keep hooks stateless. If a hook needs state across invocations (counting calls, accumulating data), use a separate store rather than storing state in the hook itself. Hooks that accumulate state become invisible sources of bugs. The state is not visible in the hook configuration and not tracked by the hook runner.

## Related

- **[Hooks and Extensions](/docs/hooks-and-extensions).** The full hook system: 27+ lifecycle events, the hook response protocol, condition syntax matching structure, SSRF protection for HTTP hooks, and agent hook sandboxing.
- **[Tool System](/docs/tool-system).** Hooks intercept the tool dispatch pipeline. Understanding dispatch is prerequisite to understanding where hooks fire.
- **[Command and Plugin Systems](/docs/command-and-plugin-systems).** Hooks integrate with the command registry, and the same condition syntax applies to both.


---

# Agent Loop Architecture

Source: https://claudepedia.dev/docs/agent-loop
Section: Patterns / core-systems

How the core two-state machine drives every agent system, from the generator pattern through graceful cancellation, termination strategies, and error recovery.

The agent loop is what makes an LLM *do things* instead of just talk. Each turn, the model decides: call a tool, or stop. That decision, and the code that enforces it, is the beating heart of every agent system. Without it, you have a chatbot. With it, you have an agent that can read files, call APIs, run computations, and deliver results the model couldn't produce on its own.

Understanding the agent loop isn't optional. It's the foundation everything else rests on. Every other pattern in this knowledge base (tool dispatch, context compaction, multi-agent coordination) only makes sense once you have a clear mental model of how the loop works and why it's structured the way it is.

The agent loop is a **two-state machine** with exactly two states:

1. **Awaiting model response**: we send the full message history to the model and wait.
2. **Dispatching tools**: the model returned tool calls, so we run them and append the results.

If the model returns tool calls, we stay in the loop. If it returns no tool calls, we exit. That's the whole model.

```mermaid
flowchart TD
    A[Start Turn] --> B[Send messages to LLM]
    B --> C{Response contains tool calls?}
    C -->|Yes - stay in loop| D[Dispatch tool calls]
    D --> E[Append results to messages]
    E --> B
    C -->|No - done| F[Return final response]
```

This structure makes tool-using agents deterministic to reason about. The model's intent (call a tool or stop) is explicit in every response. There's no hidden state, no implicit routing. Just two alternating phases driven by what the model decides to do next.

Think of the message list as a **shared ledger** between you and the model. Every turn, the model reads the full ledger and writes its response. When we run tools, we add the results to the ledger before the next turn. The model can only act on what it can see, so the ledger is the agent's entire working memory.

## The Generator Pattern

The right implementation structure for an agent loop is an **async generator**. Generators yield each completed turn as it happens, letting callers stream progress without coupling to the loop's internals. A caller can process each turn however it wants (stream events to a UI, wire in a supervisor, or just collect results) without the loop changing at all.

Here is a minimal but complete agent loop in pseudocode:

```python
async def agent_loop(messages: list[Message], tools: list[Tool], max_turns: int = 20) -> AsyncIterator[Response]:
  try:
    for turn_number in range(max_turns):
      response = await llm.call(messages, available_tools=tools)
      yield response                    # stream progress: caller observes each turn

      if response.tool_calls is empty:
        return                          # model is done, no tool calls means final answer

      messages.append(response)         # commit model response before dispatching
      for call in response.tool_calls:
        result = await tools[call.name].run(call.arguments)
        messages.append(tool_result(id=call.id, content=result))

    raise LoopError("exceeded max_turns without completion")
  finally:
    await cleanup()                     # always runs, regardless of how the caller exits
```

Three decisions embedded in this snippet are worth making explicit:

**Why yield?** Yielding each `response` means callers can observe the agent's reasoning as it unfolds, not just the final answer. This is how you build streaming UIs, per-turn logging, and supervisor agents that intervene mid-run. If we returned only the final response, all intermediate state would be invisible. Generators compose naturally: a caller can iterate over turns without the loop caring how they're consumed.

**Why `max_turns` is a correctness requirement, not a preference.** Without a hard limit, a tool that keeps returning data prompting more tool calls will run forever. The limit is the circuit breaker. Twenty turns is generous for most tasks. When an agent exceeds it, that usually signals a task design problem, not a reason to raise the limit.

**Why append the response before dispatching tools.** The model's response must be in the message history before we add tool results. If we skip this step, the model's next turn sees tool results with no preceding assistant message, which most providers reject as a malformed conversation. Order of operations is a correctness invariant, not a convention.

**Generator vs. callback.** Generators compose naturally: one generator can yield from another, building pipelines of agents without coupling layers to each other. Callbacks scatter event handling across multiple sites and make error propagation fragile. The trade-off is that async generator semantics are less familiar than callbacks. Stack traces are less obvious, and the generator lifecycle has a subtle edge case: generators don't guarantee finalization on garbage collection. Always use `try/finally` inside the generator body to ensure cleanup runs regardless of how the caller exits.

**`try/finally` cleanup ordering** matters more in async generators than in regular functions because the generator may be abandoned mid-execution. The caller can break out of the iteration loop at any point. The `finally` block runs when the generator is garbage-collected or when the caller explicitly closes it. Without it, database connections, file handles, and abort controllers leak. The rule: if your generator opens a resource, the `finally` closes it, unconditionally.

## The State Struct Pattern

Production agent loops don't use simple iteration variables. They carry all mutable state in a **typed struct** that gets replaced wholesale at every continue site. This is the key insight that separates a loop that's easy to reason about from one that accumulates subtle bugs.

Each iteration destructures the current state at the top, making all names read-only within the turn. When the loop needs to continue, it constructs a fresh state object with the updated values and a typed reason for why the loop continued. No mutation happens inside a turn. Every state change is explicit.

```python
type LoopState = {
  messages: list[Message]
  turn_count: int
  recovery_count: int
  transition: { reason: str } | None
}

state = LoopState(messages=initial_messages, turn_count=0, recovery_count=0, transition=None)

while True:
  messages, turn_count, recovery_count = state.messages, state.turn_count, state.recovery_count

  response = await call_model(messages)

  if response.is_max_output_tokens and recovery_count < MAX_RECOVERY:
    state = LoopState(
      messages=[*messages, *response.messages, continuation_prompt()],
      turn_count=turn_count,
      recovery_count=recovery_count + 1,
      transition={ "reason": "max_output_tokens_recovery" }
    )
    continue

  if not response.has_tool_calls:
    return Terminal(reason="completed")

  results = await dispatch_tools(response.tool_calls)

  if turn_count + 1 > max_turns:
    return Terminal(reason="max_turns")

  state = LoopState(
    messages=[*messages, *response.messages, *results],
    turn_count=turn_count + 1,
    recovery_count=0,
    transition={ "reason": "next_turn" }
  )
```

The benefits compound:

- **Avoids stack growth.** A loop that recurses (calling itself for each turn) accumulates stack frames across many turns. The iterative state struct pattern uses constant stack depth regardless of turn count.
- **Makes continuation reasons auditable.** When you log `state.transition.reason`, you know exactly why the loop continued: `next_turn`, `max_output_tokens_recovery`, `stop_hook_blocking`, `reactive_compact_retry`. Debugging a misbehaving loop becomes a matter of reading the log, not tracing execution.
- **Prevents accidental state bleed.** If you mutate shared variables across iterations, a bug in one turn silently infects the next. Wholesale replacement makes state boundaries explicit.

The `recovery_count` field is a good example of why this matters: it tracks how many max-output-tokens recoveries have happened *in a row*. When the loop continues normally (`next_turn`), it resets to zero. When it continues because of token overflow, it increments. You can't track this correctly with simple mutation across iterations.

## Graceful Cancellation

Cancellation is not a cleanup problem. It's a **message history correctness problem**. When the abort signal fires, any `tool_use` blocks that were emitted in the assistant message but haven't yet received a `tool_result` leave the conversation in a malformed state. Most API providers reject subsequent calls with a 400 error if there are unmatched `tool_use`/`tool_result` pairs. The loop's abort handler must emit **synthetic error tool results** for every outstanding tool call before exiting.

There are three distinct points in the loop where the abort signal can fire, each requiring different cleanup:

**Path A: mid-streaming, before any tool_use blocks arrive.** The model's response stream is in progress. No tool_use blocks have been seen yet. The loop can exit cleanly because there are no outstanding tool calls to match. Yield an interruption signal and return.

**Path B: streaming complete, tool_use blocks collected, before tool dispatch.** The model's response is fully collected and contains tool_use blocks, but tool execution hasn't started. Every tool_use block must receive a synthetic error tool_result before the loop exits.

**Path C: mid-tool-execution.** Some tools have started running. Some tool_use blocks have received results, and others haven't. The dispatcher must emit synthetic error results for the in-progress tools that didn't complete.

```python
async def run_loop(messages: list[Message], abort_signal: AbortSignal) -> Terminal:
  pending_tool_uses: set[str] = set()

  while True:
    # Path A check: before starting a new API call
    if abort_signal.aborted:
      emit_missing_tool_results(pending_tool_uses, "Interrupted by user")
      return Terminal(reason="aborted_streaming")

    async for event in call_model_streaming(messages, signal=abort_signal):
      if event.type == "tool_use":
        pending_tool_uses.add(event.id)
      # ... collect message blocks

    # Path B check: after streaming, before tool dispatch
    if abort_signal.aborted:
      emit_missing_tool_results(pending_tool_uses, "Interrupted by user")
      return Terminal(reason="aborted_streaming")

    for call in pending_tool_uses:
      result = await run_tool(call, signal=abort_signal)
      pending_tool_uses.discard(call.id)
      # Each tool's run() checks abort_signal and returns synthetic error if aborted

    # Path C check: after tool dispatch
    emit_missing_tool_results(pending_tool_uses, "Interrupted mid-execution")
    if abort_signal.aborted:
      return Terminal(reason="aborted_tools")
```

The `emit_missing_tool_results` helper is the critical piece. It takes the set of tool_use IDs that haven't received results and emits a `tool_result` with `is_error: true` for each one. Without this, the next API call will fail with a validation error, leaving the user to see a cryptic error instead of a clean abort.

**Pass the abort signal everywhere.** The signal should flow into every async operation the loop initiates: the model API call, each tool's execution, and any hook invocations. If you only check it at the top of each iteration, you get coarse-grained cancellation where the loop finishes the current turn before responding to an abort. For truly responsive cancellation, every awaited operation needs to observe the signal.

## Termination Strategies

The loop terminates in more ways than `max_turns`. Understanding all the paths, and what state they leave behind, is essential for building loops that behave predictably under all conditions.

**Normal completion.** The model returns a response with no tool calls. Yield the final response and return cleanly. Turn count can be anywhere from 1 to `max_turns`.

**max_turns exceeded.** The turn counter reaches the limit. Return a `Terminal` with reason `max_turns`. The message history is complete and consistent, with all tool_use/tool_result pairs matched. The model simply didn't finish within the budget.

**Aborted streaming.** The abort signal fired before or during the model's response stream. Return `Terminal(reason='aborted_streaming')`. Synthetic tool results may have been emitted, and the history is well-formed.

**Aborted after tools.** The abort signal fired after tool dispatch completed but before the next model call. Return `Terminal(reason='aborted_tools')`. The history includes all tool results from this turn.

**Stop hook prevented continuation.** A stop hook ran and signaled that the loop should not continue. This is a deliberate external halt: the hook has evaluated the assistant's response and decided the agent is done (or has exceeded a policy limit). Return `Terminal(reason='stop_hook_prevented')`.

**Stop hook blocking re-entry.** A stop hook returned a `blockingError`, a user message that should be injected into the conversation and trigger another turn. The loop does *not* terminate. Instead, it re-enters with the injected message appended. The transition reason is `stop_hook_blocking`.

```python
# After stop hooks complete
if stop_hook_result.blocking_errors:
  state = LoopState(
    messages=[*messages, *stop_hook_result.blocking_errors],
    turn_count=turn_count,  # don't increment: this is a re-injection, not a new turn
    recovery_count=recovery_count,  # preserve: see Production Considerations
    transition={ "reason": "stop_hook_blocking" }
  )
  continue

if stop_hook_result.prevent_continuation:
  return Terminal(reason="stop_hook_prevented", messages=messages)
```

**Token budget exhaustion.** When the loop is running with a token budget (a maximum total output token count across all turns), a `BudgetTracker` monitors progress. The logic is not a simple percentage check. It uses **diminishing-returns detection**:

```python
type BudgetTracker = {
  continuation_count: int
  last_delta_tokens: int
  last_global_turn_tokens: int
}

def check_token_budget(tracker: BudgetTracker, budget: int, total_turn_tokens: int) -> Decision:
  COMPLETION_THRESHOLD = 0.9
  DIMINISHING_THRESHOLD = 500

  delta = total_turn_tokens - tracker.last_global_turn_tokens
  is_diminishing = (
    tracker.continuation_count >= 3 and
    delta < DIMINISHING_THRESHOLD and
    tracker.last_delta_tokens < DIMINISHING_THRESHOLD
  )

  if not is_diminishing and total_turn_tokens < budget * COMPLETION_THRESHOLD:
    tracker.continuation_count += 1
    tracker.last_delta_tokens = delta
    tracker.last_global_turn_tokens = total_turn_tokens
    return Continue(nudge_message=budget_nudge(total_turn_tokens, budget))

  return Stop(diminishing_returns=is_diminishing)
```

The loop continues as long as progress is being made (each continuation adds at least 500 tokens). If three consecutive continuations each produce less than 500 new tokens, the loop stops even if the budget percentage hasn't been reached (90% threshold). This prevents the model from indefinitely generating tiny fragments when it has effectively completed the task.

**Prompt too long.** The model returns a context-overflow error instead of a response. The loop can attempt reactive compaction (summarizing old messages) once per session. If compaction has already been attempted, the loop returns `Terminal(reason='prompt_too_long')`.

**Reliable loop continuation signal.** The API's `stop_reason === 'tool_use'` field is not a reliable signal for whether to continue the loop. Track a boolean `needs_follow_up` that is set to `true` whenever a `tool_use` content block is detected during streaming. Use this flag, not the API's stop reason, to decide whether to dispatch tools and loop again.

## Error Handling Within the Loop

The agent loop must handle errors without corrupting the message history. Each error type has a different recovery strategy.

**LLM call errors: max_output_tokens.** When the model hits its per-response output limit before finishing, the API returns a partial response with a `max_output_tokens` stop reason. The loop can recover by appending a continuation prompt: "You were cut off. Please continue from where you stopped." This recovery has a limit (typically 3 attempts per turn) tracked by `recovery_count` in the state struct.

```python
if response.stop_reason == "max_output_tokens" and recovery_count < MAX_RECOVERY:
  continuation = create_system_message("Continue from where you stopped.")
  state = LoopState(
    messages=[*messages, *response.messages, continuation],
    recovery_count=recovery_count + 1,
    transition={ "reason": "max_output_tokens_recovery" }
  )
  continue
```

**Tool execution errors.** When a tool call fails (exception, timeout, or validation error), do not omit the tool result. Emit an error `tool_result` with `is_error: true` so the model sees the failure and can retry with different inputs or adapt its approach. Silently swallowing a tool error leaves the model with a `tool_use` that has no paired result, which causes an API error on the next call.

```python
try:
  result = await tool.run(call.arguments)
  tool_results.append(tool_result(id=call.id, content=result))
except ToolError as e:
  tool_results.append(tool_result(id=call.id, content=str(e), is_error=True))
  # Do NOT re-raise: the model needs to see the failure
```

**Malformed responses and tombstones.** When a model fallback occurs (switching to a secondary model after rate limiting or a provider error), partial `assistant` messages from the failed stream may be left in the message history. These are *orphaned messages*: they contain content bound to the original model's context (cryptographic signatures, thinking block references) that would cause API errors if replayed to a different model. The solution is to emit `tombstone` messages that mark orphaned messages for removal. Tombstones are stripped from the message history before the next API call, from the UI rendering, and from transcript serialization. The model effectively never sees the failed partial response.

**Error API responses and stop hooks.** When the API returns an error response (not a partial output, an actual error), stop hooks must not evaluate it. Stop hooks are designed for valid model responses. An error message is not a valid response, and feeding it to stop hooks causes them to misfire. The correct path is to call a separate `stop_failure_hooks` path (a non-blocking notification-only path) and then either attempt recovery or return a terminal state.

**The reactive compact guard.** When the message history grows large enough to cause a prompt-too-long error, the loop can trigger reactive compaction, which summarizes and truncates the history. This should happen at most once per session. The guard flag tracking whether compaction has been attempted must be **session-scoped, not turn-scoped**. If you reset it at the start of each turn, a loop that compacts, fails again (because the compacted history is still too large), and triggers compaction again will cycle until it runs out of API budget.

## Production Considerations

These insights require production experience to know. Each one has caused real bugs in production agent loops.

**Synthetic tool_result emission on abort is a hard requirement.** Any unmatched `tool_use` block in the message history causes an API 400 error on the next call. This is not a "nice to have" cleanup. It's a correctness invariant. When the abort signal fires, every outstanding `tool_use` ID must receive a corresponding `tool_result` with `is_error: true` before the loop returns. The failure mode is subtle: the loop exits cleanly, but the next time the user starts a new turn, they see a cryptic API error instead of their response. The fix is always to trace back to an unmatched `tool_use` from the previous aborted turn.

**Stop hook death spiral from session-scoped guard reset.** Stop hooks can inject blocking messages that cause the loop to re-enter. If that re-entered turn produces a prompt-too-long error and triggers reactive compaction, the compaction guard must not be reset. The failure mode is an infinite cycle: compact, still too long, error, stop hook fires, compact again. The guard (`has_attempted_reactive_compact`) must survive across stop hook re-entries. Treat it as session state, not turn state. Warning signs: API costs growing exponentially, loop never returning to the user.

**Post-sampling hooks fire before tool dispatch and cannot affect the current turn.** There is a common misunderstanding about when post-sampling hooks execute: they run immediately after the model's response stream completes, *before* tool results are awaited. They fire as fire-and-forget. They do not block tool dispatch, and they cannot change which tools are dispatched in the current turn. They can inject system messages that affect the *next* turn, but if you're trying to gate tool dispatch on a hook's decision, you need a different hook type (pre-tool hooks, not post-sampling hooks). This mistake produces loops where hook logic appears to run but has no visible effect on tool dispatch.

**Token budget diminishing-returns thresholds are empirically calibrated.** The 500-token threshold per continuation and 3-consecutive-check requirement exist because the model can continue producing tokens indefinitely in tiny fragments when it has effectively completed the task. These values are not derivable from the problem statement. They emerged from observing real loops running over-budget. If you build your own budget tracker, start with these values as calibration targets: 90% completion threshold, 500 tokens per continuation, 3 consecutive checks before stopping. Your task distribution may need different values, but these are the right starting point.

**The `stop_reason` API field is unreliable for loop continuation.** The API contract says `stop_reason === 'tool_use'` when the model stopped because it emitted tool calls. In practice, this field is not always set correctly. Production loops that rely on it exit prematurely when tool calls are present in the response. The correct sentinel is a local boolean (`needs_follow_up`) that is set to `true` whenever a `tool_use` content block is detected during streaming. Use the local flag, not the API field, for loop continuation decisions.

**Tombstoning is the correct recovery from orphaned streaming messages, not history truncation.** When a model fallback leaves partial assistant messages in the history, the temptation is to truncate the history to remove them. Truncation loses valid user and tool messages that came before the orphaned assistant message. Tombstoning is a targeted removal: mark the specific orphaned message for removal, leave everything else intact. Downstream consumers (transcript serializers, UI renderers) that understand tombstones will strip them. Those that don't will ignore them safely.

## Best Practices

**Do track loop continuation with a local flag, not the API's stop_reason.** Set `needs_follow_up = True` whenever you see a `tool_use` block during streaming. Use this flag to decide whether to dispatch tools and continue. The API stop_reason is informational. Your local flag is authoritative.

**Don't omit tool results for failed or aborted tools.** Every `tool_use` block must have a matching `tool_result` in the history, even if the tool failed or the loop was aborted. Emit error results with `is_error: True` and a descriptive error message. The model uses these to adapt, and the API requires them for correctness.

**Do use a typed state struct at every continue site.** When the loop needs to iterate, replace the state struct wholesale with a new instance that includes a typed `transition.reason`. This makes continuation reasons auditable and prevents state leakage between turns.

**Don't treat the reactive compact guard as per-turn state.** The flag tracking whether reactive compaction has been attempted must survive across all loop iterations including stop hook re-entries. Reset it only when the session ends, not between turns.

**Do propagate the abort signal to every awaited operation.** Pass the abort signal into model API calls, tool executions, and hook invocations. Checking it only at the top of each iteration gives coarse-grained cancellation where the current turn must complete before the abort takes effect.

**Don't rely on `max_turns` as the only termination condition.** Build your loop to understand all terminal states: token budget exhaustion, stop hook prevention, prompt overflow, and clean abort. Surfaces that only handle `max_turns` will appear to hang or produce cryptic errors when other termination conditions fire.

**Do use `try/finally` in the generator body.** The generator may be abandoned mid-execution. The `finally` block is the only reliable place to release resources: close connections, cancel child tasks, flush logs. Without it, resources leak silently.

**Don't evaluate stop hooks on API error responses.** Stop hooks are designed for valid model responses. When the API returns an error (context overflow, rate limiting, model error), call a failure notification path instead. Evaluating stop hooks on error responses causes them to misfire and can trigger the stop hook death spiral.

## Related

- **[Tool System Design](/docs/tool-system)**: Tools are typed functions with metadata that makes them agent-safe. The schema tells the model what arguments to pass, and the concurrency class tells the dispatcher whether parallel execution is safe. The dispatch algorithm (how consecutive safe tools are grouped into batches and unsafe tools run serially) is the other half of what the loop does on each turn.

- **[Memory and Context](/docs/memory-and-context)**: As the message list grows across turns, the loop faces context pressure. This page covers the hierarchy of forgetting, from in-context history to compressed summaries to long-term storage, and when reactive compaction should fire. The reactive compact guard discussed in Production Considerations above is the loop's interface with the memory system.

- **[Error Recovery](/docs/error-recovery)**: The loop's error handling strategies (retry logic, circuit breakers, tiered recovery) connect to the broader error recovery patterns covered on this page. Tombstoning, synthetic tool results, and continuation prompts are specific applications of the more general error recovery framework.

- **[Streaming and Events](/docs/streaming-and-events)**: The generator pattern yields streaming events that callers consume. This page covers the event type system, how partial results flow from the model through the loop to the UI, and how backpressure prevents the event stream from overwhelming consumers.

- **[Prompt Architecture](/docs/prompt-architecture)**: The system prompt is assembled before each loop iteration. Understanding the two-zone model (cached static + volatile dynamic) and the prompt assembly pipeline explains how the loop's context changes across turns.

- **[Multi-Agent Coordination](/docs/multi-agent-coordination)**: Multi-agent systems coordinate multiple agent loops. Spawning backends, mailbox communication, and tool partitioning are the patterns that connect individual loops into collaborative systems.

- **[Observability and Debugging](/docs/observability-and-debugging)**: The span hierarchy in session tracing mirrors the agent loop lifecycle. Each loop iteration produces spans that are the primary debugging surface for understanding agent behavior.

- **[Pattern Index](/docs/pattern-index)**: All patterns from this page in one searchable list, with context tags and links back to the originating section.

- **[Glossary](/docs/glossary)**: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.


---

# Tool System Design

Source: https://claudepedia.dev/docs/tool-system
Section: Patterns / core-systems

How tools are defined, registered, and dispatched in agent systems, covering concurrency partitioning, execution lifecycle, behavioral flags, dynamic tool sets, and the production patterns that make tool dispatch safe at scale.

A tool is how an agent acts on the world. Without tools, an agent can only produce text. With tools, it can read a database, call an external API, write a file, run a subprocess, or search the web. Every action the agent takes beyond generating text goes through a tool.

But a plain function isn't enough. The agent loop dispatches tool calls based on what the model requests, and the model has no idea what your code looks like. It only knows what you tell it. That means a tool is really two things: the function that does the work, and the metadata that tells the rest of the system how to handle it safely. The metadata is the design. The function body is the plumbing.

A tool is a **typed function with metadata**. The metadata has three parts:

- **Schema**: the typed contract between the model and the code. The model reads this to know what arguments to pass. Without a schema, the model is guessing.
- **Concurrency class**: tells the dispatcher whether parallel execution is safe. Some tools read files and can run in parallel. Others write to shared state and must run serially.
- **Behavioral flags**: cross-cutting concerns declared as data, not embedded in the function body. Is this tool destructive? Does it require user permission? Can it be interrupted?

Here is a tool definition with full metadata:

```python
tool search_files(
  pattern: string,
  directory: string,
) -> SearchResult[]:
  metadata:
    schema: { pattern: string, directory: string }
    concurrency_class: READ_ONLY
    max_result_size_chars: 500_000
    behavioral_flags:
      is_destructive: false
      requires_permission: false
      interrupt_behavior: 'block'

  return filesystem.search(pattern, in=directory)
```

The function body is two lines. The metadata is seven. That ratio is intentional. Most of the tool design work is in the metadata, because the metadata is what makes the tool usable in an automated system where you can't watch every call.

## Concurrency Classes

The concurrency class answers one question: **is it safe to run this tool in parallel with other tools?**

The three classes form a spectrum:

- **`READ_ONLY`**: The tool only reads. No shared state is modified, so multiple instances can run concurrently without interference. Example: searching files, fetching a URL, reading a config value.
- **`WRITE_EXCLUSIVE`**: The tool writes to shared state. It must run serially: other tools must finish before this one starts (or vice versa). Example: writing a file, inserting a database row, sending an email.
- **`UNSAFE`**: The tool has side effects that are hard to bound or undo. It runs serially and in isolation, in a subprocess or sandbox where it can't interfere with the agent's own state. Example: executing arbitrary shell commands, running untrusted code.

One subtlety worth naming: **concurrency class is determined at dispatch time, not at registration time.** The dispatcher calls `is_concurrency_safe(parsed_input)` once per tool call, passing the actual arguments. A shell execution tool running `ls` might be concurrency-safe. The same tool running a destructive command is not. This is not a static property stamped onto the tool type. It's a runtime judgment about a specific invocation.

If `is_concurrency_safe` throws (for example, because the input fails to parse), the conservative fallback is `false`. Schema parse failure also defaults to `false`. The system treats all ambiguous cases as unsafe. It never optimistically assumes concurrent dispatch is okay.

## The Dispatch Algorithm

When the model returns multiple tool calls in a single response, the dispatcher must decide which tools to run in parallel and which to serialize. The algorithm is **partition-then-gather**: split the tool call list into consecutive batches, where each batch is either a concurrent group of safe tools or a single unsafe tool. Process batches sequentially. Within each safe batch, run tools in parallel.

The partitioning rule is simple: extend the current safe batch if the current tool is safe and the previous batch is also safe. Otherwise start a new batch.

The following example shows the core partition logic:

```python
function partition_tool_calls(calls: ToolCall[], context) -> Batch[]:
  batches = []
  for call in calls:
    tool = find_tool(call.name)
    try:
      parsed = tool.input_schema.parse(call.input)
      is_safe = tool.is_concurrency_safe(parsed)
    except:
      is_safe = False   # conservative: treat parse failure as unsafe

    if is_safe and batches and batches[-1].is_safe:
      batches[-1].calls.append(call)   # extend existing safe batch
    else:
      batches.append(Batch(calls=[call], is_safe=is_safe))

  return batches
```

And the dispatch loop that processes batches:

```python
async function dispatch_all(batches: Batch[], context) -> Result[]:
  results = []
  for batch in batches:
    if batch.is_safe:
      batch_results = await gather(*[run_tool(c, context) for c in batch.calls])
    else:
      batch_results = []
      for call in batch.calls:
        batch_results.append(await run_tool(call, context))
    results.extend(batch_results)
  return results
```

A critical property: **safe groups on either side of an unsafe tool are never merged.** If the model requests `[search_A, search_B, write_C, search_D, search_E]`, the batches are `[search_A, search_B]`, `[write_C]`, `[search_D, search_E]` (three separate batches, processed in that order). The second safe group doesn't execute until `write_C` completes. Order is preserved across the full result list.

The maximum number of tools that can run concurrently within a safe batch is capped (a default of 10, configurable via environment variable). This prevents a single model response with 50 safe tool calls from overwhelming downstream services.

## Tool Lifecycle

Every tool call goes through a fixed sequence of phases before the result is returned to the model. Understanding this lifecycle matters because each phase can fail, and each failure produces a `tool_result` with `is_error: true` that the model sees and can act on.

**Phase 1: Schema validation (shape and types)**

The raw input from the model is parsed against the tool's declared schema. If the model provided arguments of the wrong type, passed an unknown field, or omitted a required field, this phase fails and returns an error result. The error message includes the schema mismatch details, giving the model enough information to retry with corrected input.

**Phase 2: Semantic validation (business logic)**

Each tool can implement an optional `validate_input` method that runs after schema parsing succeeds. This phase validates business logic: does the file path exist? Is the command in the allow list? Are the date ranges valid? Like schema failures, semantic failures return an error result with an explanatory message.

The two phases always both run. There is no short-circuit where schema success skips semantic validation. Together they form a two-layer validation approach:

```python
async function execute_tool(tool, call, context) -> ToolResult:
  # Phase 1: Schema validation (shape and types)
  parsed = tool.input_schema.safe_parse(call.input)
  if not parsed.success:
    return error_tool_result(
      call.id,
      f"InputValidationError: {parsed.error}"
    )

  # Phase 2: Semantic validation (business logic)
  validation = await tool.validate_input(parsed.data, context)
  if validation.result == False:
    return error_tool_result(
      call.id,
      f"ValidationError: {validation.message}"
    )

  # Execute with validated, semantically-checked input
  return await tool.call(parsed.data, context)
```

**Phase 3: Permission check**

After both validation phases pass, the permission system evaluates whether this tool call is allowed to proceed. Permissions are checked on validated input (the parsed, semantically-checked arguments) so classifiers and permission rules see the same structured data that the tool will receive.

**Phase 4: Pre-tool hooks**

Before the tool executes, any registered pre-tool hooks run. Hooks can inspect the tool call, inject additional context, modify the input, or block execution. A hook that blocks returns an error result without ever calling the tool function.

**Phase 5: Execute**

The tool function runs with the validated and possibly hook-modified input.

**Phase 6: Post-tool hooks**

After execution completes, post-tool hooks run. They observe the result but typically cannot modify it. They're used for logging, analytics, and side effects.

Each validation failure and each hook rejection produces a properly formatted `tool_result` message, so the model always receives a complete response for every `tool_use` it issued. An incomplete response (a `tool_use` with no matching `tool_result`) causes an API error on the next request.

**Interrupt behavior**

Every tool can declare what happens when a user submits a new message while the tool is still running:

- `'cancel'`: stop the tool immediately and discard its result
- `'block'`: keep running, and the user's new message waits until the tool finishes

The default is `'block'`. Long-running read operations typically use `'block'` because stopping mid-read could leave the agent in an inconsistent state. Destructive operations might use `'cancel'` because if the user explicitly wants to stop a delete operation, stopping it is the right call.

## Behavioral Flag Composition

The behavioral flags on a tool are declarations, not enforcement mechanisms. The dispatcher, permission system, and UI read these flags to make routing decisions. The flags themselves don't restrict anything. This separation matters because it means enforcement is centralized and auditable.

The key flags and how they compose:

**`is_concurrency_safe(input)`**: Runtime function, not a boolean field. Called by the dispatcher at dispatch time with the actual parsed input. Returns `true` if it's safe to run this specific invocation alongside other concurrent tools. When in doubt, return `false`. The performance cost of unnecessary serialization is much lower than the correctness cost of unintended concurrent writes.

**`is_read_only(input)`**: Declares that the tool does not modify any persistent state. Used by the permission system to make fast-path allow decisions. A tool marked `is_read_only` may still require permission for other reasons (policy rules, explicit ask rules). It's one input to the permission decision, not a bypass.

**`is_destructive(input)`**: Declares that the tool performs an irreversible operation: delete, overwrite, send. Default is `false`. The permission system and UI use this to surface extra confirmation. Only set to `true` when the operation genuinely cannot be undone: file deletion, email sending, database record removal.

**`interrupt_behavior()`**: Declares the cancel-or-block behavior described in the lifecycle section. This flag is read by the UI layer to determine whether a running tool can be stopped by the user.

**`requires_user_interaction()`**: Declares that the tool must interact with the user directly (for example, showing a dialog). Tools with this flag should not be called in non-interactive contexts (batch mode, background agents). The dispatcher checks this flag before execution and returns an error result if the session doesn't support interaction.

**`should_defer` / `always_load`**: Two ends of the tool visibility spectrum. `should_defer` marks a tool as deferred: its schema is not included in the initial model prompt. The model must explicitly search for and load the tool before calling it. `always_load` is the opposite: the tool's schema always appears in the prompt even when tool deferral is enabled for everything else. Use `always_load` for tools the model must discover on turn 1 without a search round-trip.

The composition rule at dispatch time: the dispatcher reads `is_concurrency_safe` to partition batches. The permission system reads `is_read_only`, `is_destructive`, and `requires_user_interaction` to make per-call permission decisions. The UI reads `interrupt_behavior` and `is_destructive` to determine what controls to show the user. No single system reads all the flags. Each system reads only what it needs.

## Dynamic Tool Sets

Tools don't have to be static across the lifetime of an agent session. The tool context carries an optional `refresh_tools` callback. At the end of each loop iteration, after all tool results from that turn are complete, the loop calls `refresh_tools()` and compares the result to the current tool list. If they differ, the next iteration starts with the updated tool list.

```python
type ToolContext = {
  options: {
    tools: Tool[]
    refresh_tools: () -> Tool[] | None  # optional callback
    # ...
  }
  # ...
}

# At end of each iteration, after tool results are complete:
if context.options.refresh_tools:
  fresh_tools = context.options.refresh_tools()
  if fresh_tools != context.options.tools:
    context = context.with(options=context.options.with(tools=fresh_tools))

# Next iteration sees the updated tool list
```

The key invariant: **tools are immutable within a single iteration, potentially different on the next one.** This is what makes dynamic tool sets safe. The model receives a consistent tool list for its entire response in a given turn. It can't be in the middle of requesting tools that are about to disappear.

This pattern enables several important capabilities:

- **MCP server connections mid-session:** When an MCP server connects after the session starts, `refresh_tools` returns the new tool list and the agent immediately has access to the new tools on the next turn.
- **Conditional tool availability:** The `is_enabled()` flag on each tool gates whether it's included in the current tool list. Tools can be temporarily unavailable based on session state.
- **Permission-based filtering:** The dispatcher filters tools at dispatch time based on permission rules. A tool that the user has blocked won't be offered to the model even if it's in the registered list.
- **Deferred tools:** Tools marked `should_defer` aren't included in the initial prompt. They only appear after the model explicitly searches for them and loads their schema. This keeps the initial prompt compact when the agent has access to hundreds of tools.

## Schema and the LLM

The schema is how the model knows what to call. When the agent loop presents the model with a list of available tools, each tool's schema becomes part of the prompt. The model reads it and decides whether calling this tool would help accomplish the current task, and if so, what arguments to pass.

Schemas use JSON Schema format: type definitions, required fields, descriptions, and examples. The description field on each argument is especially important: it's the model's only guidance about what the argument means. Write argument descriptions as if the model is going to read them cold, with no other context. Because it will.

There's a sharp edge worth knowing: most LLM APIs that accept tool schemas do not support the full JSON Schema 2020-12 specification. In particular, they don't support `$ref`, `$defs`, or `allOf` (the composition mechanisms that JSON Schema uses to share definitions between fields). This means schemas with nested types often need post-processing before being sent to the API: all references must be inlined. If your tool schemas use Pydantic models or other schema-generating libraries, check whether they generate $ref-based schemas that will need flattening before dispatch.

> **Note:** The two-phase validation pattern means schema errors and semantic errors are reported separately to the model, giving it better signal for retry. A schema error means "I passed the wrong type." A semantic error means "I passed the right type but the value was invalid." The model can use these distinct signals to construct a more targeted correction.

## Fail-Closed Defaults

What happens when a tool forgets to declare a behavioral flag?

In most systems, a missing value means the default is permissive: undefined behavior is allowed. In a tool system, the opposite is safer. **A missing flag should default to the most restrictive value.**

A tool that doesn't declare `is_concurrency_safe` is treated as not concurrency-safe. A tool that doesn't declare `requires_permission` is treated as requiring permission. A tool that doesn't declare a concurrency class is treated as unsafe.

This is a deliberate design choice. The cost of treating a safe tool as unsafe is a small performance hit: it runs serially when it could have run in parallel. The cost of treating an unsafe tool as safe is data corruption, permission bypass, or worse. The asymmetry is obvious once you name it.

> **Note:** Fail-closed defaults don't require special framework support. They're implemented as simple default values in the metadata schema: `is_concurrency_safe = false`, `requires_permission = true`, `is_destructive = true`. Any tool that explicitly overrides these defaults is making an affirmative claim that it's safe to relax them.

The `build_tool` factory function pattern centralizes these defaults:

```python
TOOL_DEFAULTS = {
  is_enabled: () -> True,
  is_concurrency_safe: (_input) -> False,  # fail-closed
  is_read_only: (_input) -> False,          # fail-closed
  is_destructive: (_input) -> False,
  check_permissions: (_input, _ctx) -> allow(),
}

function build_tool(definition: ToolDef) -> Tool:
  return { ...TOOL_DEFAULTS, ...definition }
```

Every tool definition goes through `build_tool`. Any field the definition omits gets the conservative default. A tool that explicitly implements `is_concurrency_safe` to return `True` for certain inputs is making an affirmative safety claim, one that will be evaluated at runtime.

## Production Considerations

**Sibling abort: one tool failing can cancel its concurrent siblings**

When tools run in a concurrent batch, one tool erroring mid-execution doesn't necessarily mean the others should finish. A child abort controller (distinct from the session-level abort controller) signals all concurrently running tools in the same batch to stop. The parent session is not aborted. Only the current tool batch is cancelled. Without this, a failing tool in a concurrent batch would allow its siblings to run to completion, wasting time and potentially producing results that will never be used.

**Result size offload prevents context window monopolization**

Every tool declares a `max_result_size_chars` field. When a tool result exceeds this limit, the content is persisted to a temporary file and the model receives a preview (the file path and a sample of the content) instead of the full result. This prevents a single large tool result (reading a 1MB log file, for example) from consuming most of the context window and crowding out other messages.

Tools that should never be offloaded (typically tools whose output is already bounded by their own limits) set `max_result_size_chars = Infinity` to opt out explicitly. This avoids a circular problem: a file-reading tool that offloads to a file would create a situation where the model reads the offload file with the same tool and risks offloading again.

**Tool aliases enable backward-compatible renames**

The tool interface supports an optional `aliases` field, a list of alternative names the tool will respond to in addition to its primary name. When the dispatcher looks up a tool by name, it checks both the primary name and the alias list.

This solves a real versioning problem: old conversation transcripts reference tools by name. If you rename a tool, replaying those transcripts would generate "no such tool" errors for every call to the old name. With aliases, you can rename `KillShell` to `TaskStop` and add `KillShell` to the alias list. Old transcripts continue to work without modification.

The fallback path for alias resolution is intentionally narrow: it only activates when the alias-matching tool is found in the global base tools registry, not the current session's tool list. This prevents unauthorized alias injection, where an external party claims to have a tool that matches an alias in hopes of hijacking the resolution path.

**Input-dependent concurrency: treat parse failure as unsafe**

Because `is_concurrency_safe` is called with parsed input, it can fail if the input fails to parse. The conservative handling (treating any exception in `is_concurrency_safe` as `false`) prevents a subtle class of bugs. If the function throws, it might be because the input is malformed in a way that makes concurrency unsafe. Defaulting to serial execution in that case is the right call: the performance cost is minimal, and the safety guarantee is preserved.

The broader pattern: never let a failure in the concurrency classification function cause optimistic concurrent dispatch. The asymmetry of correctness (serializing when concurrent would have been fine) vs. incorrectness (running concurrently when serial was required) strongly favors the conservative path.

## Best Practices

**Do declare `is_concurrency_safe` as a function, not a constant.** A tool that is sometimes safe and sometimes not (depending on what it's called with) must inspect the actual input at dispatch time. A constant `is_concurrency_safe = True` that ignores input is dangerous. A `read_file` tool can return `True`. A `bash` tool must inspect the command.

**Don't ignore semantic validation.** Schema validation catches type errors. Semantic validation catches correctness errors. A tool that validates only schema will accept `delete_file(path="/")` as valid because the path is a string. Semantic validation is where you check that the path exists, is in scope, and doesn't point at something critical.

**Do keep tool schemas narrow.** The model hallucinates arguments it wasn't told about. A schema with 12 optional fields invites the model to pass fields that shouldn't be set. Design schemas with the minimum fields needed for each use case. If two use cases need genuinely different shapes, consider two tools.

**Don't use `is_destructive` as a permission bypass.** The `is_destructive` flag changes how the UI presents the operation: it triggers confirmation dialogs and highlights in the tool use view. It does not skip the permission check. A destructive tool still goes through the full permission lifecycle.

**Do set `max_result_size_chars` on every tool.** Never leave it unset (which would default to a potentially huge or zero limit). Pick a value appropriate to the expected output size. Tools that produce bounded output (a status code, a count) can set this high. Tools that read arbitrary file content should set it to something like 100KB-500KB to avoid context overflow.

**Don't name tools vaguely.** The model reads tool names and descriptions to decide which tool to call. "process_data" tells the model nothing. "extract_csv_rows" tells the model exactly when to reach for it. Precise naming reduces hallucination and improves first-call accuracy.

**Do use `always_load` sparingly.** Every tool with `always_load` increases the size of the initial prompt. In sessions with dozens of tools, that adds up. Reserve `always_load` for tools the model genuinely needs on turn 1: startup checks, user interaction tools, tools required for the model to discover other tools.

**Don't forget to wire `interrupt_behavior`.** The default is `'block'`, which is correct for most read operations. For long-running operations that the user might want to cancel (shell commands, network requests, file uploads) decide explicitly whether `'cancel'` or `'block'` is appropriate. A tool with `'block'` will prevent the user from interrupting a runaway operation.

## Related

- **[Agent Loop Architecture](/docs/agent-loop)**: The loop is what dispatches tool calls on each turn. Understanding how the loop calls tools, when it dispatches, how it handles errors, and what triggers the next turn gives the tool system its execution context.

- **[Safety and Permissions](/docs/safety-and-permissions)**: The permission system gates every tool call between validation and execution. This page explains the full permission cascade, graduated trust, and the classifier patterns that make auto-approval safe.

- **[Error Recovery](/docs/error-recovery)**: Tool errors are one of the most common recovery scenarios in agent systems. This page covers retry strategies, circuit breakers, and how to design tools that signal recoverable vs. unrecoverable failures.

- **[Streaming and Events](/docs/streaming-and-events)**: During streaming, tool use blocks arrive incrementally. Understanding the streaming model explains why tool dispatch must handle partial inputs, why the `is_concurrency_safe` check happens at the point of full input availability, and how concurrent tool results are interleaved in the event stream.

- **[Command and Plugin Systems](/docs/command-and-plugin-systems)**: Commands are the user-facing layer built on top of the tool system. The lazy-loading registry, metadata-first registration, and multi-source merge patterns extend the same principles that govern tool registration.

- **[MCP Integration](/docs/mcp-integration)**: MCP bridges external tools into the agent's tool system. The tool bridge pattern constructs Tool objects structurally identical to built-in tools, so the dispatcher handles them with the same concurrency and permission logic.

- **[Hooks and Extensions](/docs/hooks-and-extensions)**: Hooks wrap tool execution at the `PreToolUse` and `PostToolUse` lifecycle events. Understanding the tool lifecycle phases makes hook timing and interception points clearer.

- **[Multi-Agent Coordination](/docs/multi-agent-coordination)**: In multi-agent systems, tool sets are partitioned between coordinator and workers. The coordinator gets orchestration tools while workers get domain tools: the same registry, different filtered views.

- **[Pattern Index](/docs/pattern-index)**: All patterns from this page in one searchable list, with context tags and links back to the originating section.

- **[Glossary](/docs/glossary)**: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.


---

# Prompt Architecture

Source: https://claudepedia.dev/docs/prompt-architecture
Section: Patterns / core-systems

How the static/dynamic boundary in agent prompts affects cost, latency, and consistency, with the full assembly pipeline, five-level priority chain, memory injection, and production insights for cache management.

Most developers write their first agent prompt the way they write a chat message: pile everything in, see what works, tune from there. That approach breaks down fast. An agent prompt isn't a message you send once. It's an input that gets reconstructed every turn, sent to an expensive API, and shapes every decision the model makes for the entire session. Prompt structure is not cosmetic.

The insight that changes how you think about prompts: **they have two zones**. One zone stays identical across every user, every session, every turn, and it can be cached. The other zone changes per-session or per-turn, and it cannot. Where you draw the line between these zones determines your cost, your latency, your behavior consistency, and how easily you can test your agent in isolation. This is an architectural decision, not just a wording choice.

## The Two-Zone Model

The system prompt splits into two zones at a structural boundary:

**Static zone**: content that is identical for every user, every session, every turn:
- Identity (the opening sentence that primes the model's behavioral profile)
- Behavioral rules (scope, verbosity, safety constraints)
- Tool usage instructions (how to use the available tools)
- Numeric calibration (confidence thresholds, response length limits)

**Dynamic zone**: content that varies per-session or per-turn:
- Session context (working directory, operating system, model name)
- User memory (retrieved long-term facts from previous sessions)
- Active tool availability (if tools connect and disconnect mid-session)
- Token budget remaining (changes as the session progresses)

```python
function build_system_prompt(session: Session) -> string:
  # Static zone: identical for every user, every session, every turn
  static_sections = [
    identity_section(),          # "You are an agent that..."
    behavioral_rules_section(),  # scope, verbosity, calibration
    tool_usage_section(),        # how to use the available tools
  ]

  # --- STATIC/DYNAMIC BOUNDARY ---

  # Dynamic zone: varies per session or per turn
  dynamic_sections = [
    session_context(session),    # working directory, OS, model name
    user_memory(session),        # retrieved long-term facts
    active_tools(session),       # changes if tools connect/disconnect
  ]

  return join(static_sections + dynamic_sections)
```

Think of the static zone as a **function signature**: clear contracts that don't change between calls. Think of the dynamic zone as **function arguments**: variable inputs for this particular call. Mixing them creates the same problems as side-effectful code that reads global state. You can't test the static part without instantiating the dynamic part, you can't cache without worrying about invalidation, and small dynamic changes pollute the stable contract.

**The cache fragmentation problem.** Prefix caching works by recognizing byte-identical prefixes across API calls. If any session-variable content appears inside the static zone (say, the working directory injected into the rules section), the cache key differs between users and sessions. With N session-variable bits interleaved in the static content, you get 2^N possible cache keys. Move all variable content below the boundary, and you get one cache key for the entire static prefix. Every call from every user on every turn hits the same cached prefix.

## Identity Design

The first sentence of the system prompt matters more than any other. The model reads identity before any rules, and identity primes the entire behavioral profile: how assertively the model speaks, how much it delegates vs. acts, how it frames uncertainty.

Different operational modes need different identities. The verbs encode behavior:

```python
# Different identities prime different behaviors
assistant_identity = "You help users accomplish tasks by reading, searching, and editing files."
coordinator_identity = "You orchestrate complex tasks by delegating to specialized worker agents."
worker_identity = "You are an agent for code analysis. You read files and report findings."
```

"Helps users" means responsive, user-centric, conversational.
"Orchestrates" means delegates, coordinates, doesn't do the work itself.
"An agent for" means subordinate, task-scoped, narrow.

These aren't soft style choices. A coordinator agent that thinks it "helps users" will try to solve problems directly instead of delegating. A worker agent that thinks it "orchestrates" will spawn sub-agents instead of executing. Identity and role must match.

Identity belongs in the static zone. It never changes between sessions because it defines what kind of agent this is, not anything about the current session.

## Calibration Through Numbers

Behavioral calibration expressed as explicit numeric constraints is more reliable than vague adjectives. The model doesn't have a shared definition of "concise" or "careful" with you. It interpolates from training data, which varies. Numbers are unambiguous.

Compare:

| Vague | Numeric |
|-------|---------|
| "Be concise" | "Maximum 3 sentences per explanation unless the user asks for more" |
| "Be careful with destructive operations" | "Only run destructive operations when confidence exceeds 0.95" |
| "Don't create too many files" | "Maximum 5 new files per task unless the user explicitly requests more" |
| "Ask clarifying questions when unsure" | "Ask at most 2 clarifying questions before attempting the task with your best interpretation" |

The numeric version is also easier to test: you can write a simple check that counts sentences, files created, or questions asked. The vague version requires subjective judgment.

Calibration belongs in the static zone. It defines the agent's operating envelope, meaning constraints that apply equally to all users and all sessions. If a particular user wants different limits, that's a session-level configuration that goes in the dynamic zone as a preference, not a modification to the static rules.

## The Prompt Assembly Pipeline

Building a prompt by concatenating strings is a recipe for fragile cache management. A better approach: a **section registry** where each piece of prompt content is registered as a named section with an explicit cache intent.

The two-function API encodes cache intent at registration time:

```python
# Static zone: computed once, cached across turns (memoized)
identity_section = register_cached_section(
  name: "identity",
  compute: () => "You are an agent that helps users accomplish tasks..."
)

behavioral_rules_section = register_cached_section(
  name: "rules",
  compute: () => load_static_rules()
)

# Dynamic zone: recomputed every turn (cache-breaking).
# The verbose name forces the caller to justify why this section
# cannot be cached. It's intentional friction, not just naming style.
token_budget_section = register_volatile_section(
  name: "token_budget",
  compute: () => f"Remaining context: {get_remaining_tokens()} tokens",
  reason: "Token count changes every turn and cannot cache"
)

memory_section = register_volatile_section(
  name: "user_memory",
  compute: () => load_memory_prompt(),
  reason: "Memory file content changes between sessions"
)
```

The naming convention for the volatile variant is itself a safety mechanism. Making the cache-breaking registration intentionally verbose (requiring a `reason` argument) means developers cannot reach for it casually. Every cache-breaking section has a documented justification. When you audit prompt performance, you read the reasons, not the code.

**The five-level prompt priority chain.** When building the final effective prompt, different modes and configurations need to override or append to the base. A priority chain resolves this cleanly:

```python
function build_effective_prompt(config: PromptConfig) -> string:
  # Level 0: override prompt, replaces everything (specialized modes only)
  if config.override_prompt:
    return config.override_prompt + append_tail

  # Level 1: coordinator prompt (coordinator/orchestrator mode)
  if config.coordinator_prompt:
    base = config.coordinator_prompt

  # Level 2: agent system prompt (from agent definition)
  # Replaces or appends to default depending on agent mode
  elif config.agent_system_prompt:
    if config.agent_mode == "append":
      base = default_prompt + config.agent_system_prompt
    else:
      base = config.agent_system_prompt

  # Level 3: custom system prompt (user-provided flag)
  elif config.custom_system_prompt:
    base = default_prompt + config.custom_system_prompt

  # Level 4: default system prompt (base case)
  else:
    base = default_prompt

  # Append-tail: always appended, except to override prompts
  return base + append_tail
```

The append-tail pattern deserves attention. It's a safety valve that injects content (memory correction hints, team policy additions, per-session overrides) outside the priority chain. Instead of modifying a static zone section (which would fragment cache keys) or overriding the entire prompt (which would lose the default loop behavior), you append to the tail. The tail is always present, always last, and independent of which level won the priority chain.

## Memory as Dynamic Zone Content

Memory is not separate from the prompt. It is part of the prompt. Long-term facts, user preferences, and project context are injected into the dynamic zone each turn, updating the model's knowledge of the current session state.

This creates a direct coupling between the memory system and prompt structure: **the memory system must respect prompt-layer constraints**. A memory manifest that grows unboundedly will eventually consume a disproportionate share of the context budget before the conversation even begins. The manifest (a single index listing all memory files) must be capped at a line count and byte limit. Without these caps, a session with many long memory entries will exhaust its context budget before the agent can respond.

```python
function load_memory_prompt(memory_dir: str) -> str:
  manifest = read_manifest(memory_dir)

  # Cap manifest: long manifests consume context budget before the conversation begins
  manifest = truncate_manifest(
    manifest,
    max_lines: 200,
    max_bytes: 25_000
  )

  facts = load_fact_files(manifest.entries)
  return format_memory_section(manifest, facts)
```

Why the byte cap in addition to the line cap? Manifests that are under the line limit but contain very long lines (long file paths, long descriptions) can still be large. The byte cap catches this failure mode.

Memory content flows into the dynamic zone through the volatile section registration path. This means memory changes are not cached. Each turn loads fresh memory content and includes it in the prompt. This is correct behavior: memory is updated between turns (by the background extractor), so stale cached memory would defeat the purpose.

This coupling has an important implication for prompt budget planning: the memory section is not a fixed cost. It grows as the user accumulates preferences, project decisions, and feedback across sessions. A user who has interacted with the agent for six months will have significantly more memory content than a new user. Design the context budget assuming the memory section can consume 5-15% of the effective context window in active sessions. The caps are your safety valve when it grows larger than expected.

## Production Considerations

**Cache clear on compaction is a one-time but necessary cost.** When the conversation compacts, the message history changes. Any memoized prompt sections that included message-dependent content (turn count, token budget, memory references) must be invalidated and re-evaluated. The re-evaluation cost is bounded and one-time, but it's real: every cached section is recomputed after compaction. If you're using a lazy initialization pattern for heavy sections (expensive DB lookups, large file loads), factor in the re-evaluation cost when compaction occurs.

**The verbose volatile registration name is a security mechanism.** Making the cache-breaking variant require a reason argument does more than enforce documentation. It creates friction that prevents developers from accidentally using it for content that could be static. In a team setting, the `reason` field is the first thing you read in a prompt audit. If the reason says "changes per user" but the content is actually the same for all users, that's an easy fix to find. Without the reason requirement, cache fragmentation silently accumulates across the team's contributions.

**Proactive mode composition pattern.** When an agent mode appends its instructions to the default prompt rather than replacing it, the agent gains domain-specific behavior on top of the default loop control structure. The anti-pattern is replacing the default prompt entirely, which loses the autonomous loop behavior embedded in the default identity and rules sections. Specialized agents should always append domain instructions. Only root-level loop modes should override the entire prompt.

**Cache fragmentation math is not theoretical at team scale.** With N interleaved dynamic bits in the static zone, 2^N cache keys. In a codebase maintained by a team, these bits accumulate over months: somebody adds a per-user flag here, a session-variable there. Each addition doubles the cache keyspace. The section registry with its explicit cache-intent declaration prevents this because the static/dynamic boundary is enforced structurally, not by convention.

**The two-function API also makes auditing tractable.** When you need to understand why caching is underperforming, the section registry gives you a complete inventory: every registered section, its cache intent, and its declared reason if volatile. Without the registry, tracking down cache-breaking content means reading the entire prompt assembly code path and identifying variable substitutions by hand. With the registry, it's a one-liner to list all volatile sections and their reasons.

## Best Practices

- **DO** separate prompt content into individually registered sections with explicit cache intent (cached vs volatile)
- **DO** put identity, rules, and calibration in the static zone. They're identical across users, sessions, and turns.
- **DO** require a reason argument when registering a volatile section. It creates friction that prevents casual cache-breaking.
- **DO** use the append-tail for memory hints, team policy additions, and per-session overrides
- **DO** clear the section cache on compaction and session reset. Memoized sections must re-evaluate to pick up session-level changes.
- **DO** cap memory manifest at line count and byte limit before injecting it into the dynamic zone
- **DON'T** interleave variable content in the static zone. Each bit doubles the number of cache keys.
- **DON'T** have agent modes replace the entire default prompt. Append domain instructions to preserve the loop control structure.
- **DON'T** inject unbounded memory content into the prompt. The memory system and prompt structure share the same context budget.

## Related

- **[Memory and Context](/docs/memory-and-context)**: Memory determines what fills the dynamic zone. Retrieved long-term facts and session context are assembled each turn and injected into the prompt. The memory system's size constraints directly affect available context budget.

- **[Agent Loop Architecture](/docs/agent-loop)**: The loop sends the assembled prompt at the start of every turn. The static zone is the portion the loop can cache across calls. The dynamic zone is rebuilt each turn.

- **[Tool System](/docs/tool-system)**: Tool descriptions are part of the static zone. Understanding how tool definitions are structured and how they consume static budget connects prompt architecture to tool design.

- **[Error Recovery](/docs/error-recovery)**: Retry behavior can be calibrated in the static zone via numeric constraints. Understanding what goes in the prompt vs what goes in the retry configuration clarifies the division of responsibility.

- **[Hooks and Extensions](/docs/hooks-and-extensions)**: Hooks are the primary extension mechanism for prompt assembly. The `UserPromptSubmit` and `PreToolUse` hooks can inject, modify, or gate prompt content at defined lifecycle points without touching the static/dynamic zone structure directly.

- **[Pattern Index](/docs/pattern-index)**: All patterns from this page in one searchable list, with context tags and links back to the originating section.

- **[Glossary](/docs/glossary)**: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.


---

# Memory and Context

Source: https://claudepedia.dev/docs/memory-and-context
Section: Patterns / core-systems

How agent systems manage what the LLM sees, covering compaction pipeline internals, fact extraction subsystem, long-term storage with a closed taxonomy, and production insights for running memory management at scale.

The model can only act on what it sees. Every piece of context (the conversation history, tool results, retrieved facts, system instructions) must fit inside a finite context window. Once the window fills, something has to give: you either compress the old content, move it to external storage, or discard it entirely. There is no fourth option.

This makes memory management one of the most consequential design decisions in an agentic system. Get it wrong and you either run out of context mid-task, spend a fortune on unnecessary LLM calls to summarize content, or lose information the agent needs to finish the job. The mental model that makes this tractable is the **hierarchy of forgetting**: four levels of memory, each trading fidelity for space. Understanding which level is right for which information, and in what order to apply the cheaper options first, is what separates robust agents from fragile ones.

## The Hierarchy of Forgetting

Memory exists at four levels. Each level down the hierarchy trades fidelity for space:

1. **In-context (message list):** Perfect fidelity. The full conversation history, tool results, and injected context, all of which the model can see this turn. Cost: tokens. Grows unboundedly if you let it.

2. **Summary (compressed digest):** LLM-generated condensation of old conversation segments. Loses sequential detail and exact phrasing. Saves significant space. Cost: one LLM call to generate.

3. **Long-term storage (fact files):** Structured facts persisted between sessions, including user preferences, project decisions, feedback, and references. Survives session end. Loses sequential context entirely because facts are extracted, not archived.

4. **Forgotten:** Information that was in-context but discarded without preservation. Zero cost, zero fidelity.

```mermaid
flowchart TD
    A["In-Context (message list)"] -->|"compaction: summarize"| B["Summary (compressed digest)"]
    B -->|"long-term extraction"| C["Long-term Storage (fact files)"]
    C -->|"eviction / no-op"| D["Forgotten"]
    A -->|"cheap trim: drop directly"| D
```

The important insight is that each level is a **design choice, not a fallback**. Developers tend to treat compaction as something that happens when things go wrong. The better framing: different categories of information belong at different levels *proactively*. Ephemeral tool results (the contents of a file you just read for a one-off check) belong at level 4. Drop them early. User corrections and explicit preferences belong at level 3. Extract them to long-term storage before they get compressed away. Active working context (the current subtask, the plan the agent is executing) belongs at level 1. Matching information to its appropriate level is the core skill.

## The Compaction Pipeline

When context pressure builds, the instinct is to call the LLM and summarize the conversation. That instinct is expensive and usually wrong. Most context pressure is resolvable without any LLM calls at all.

Run interventions in cost order, cheapest first and most expensive last:

```python
function maybe_compact(messages, context_window_size):
  usage = count_tokens(messages)
  headroom_needed = context_window_size * 0.15

  if usage < context_window_size - headroom_needed:
    return messages                              # no action needed

  # Level 1: trim large tool results (zero LLM cost)
  messages = trim_oversized_tool_results(messages)
  if tokens_ok(messages): return messages

  # Level 2: drop oldest messages (zero LLM cost)
  messages = drop_oldest_messages(messages)
  if tokens_ok(messages): return messages

  # Level 3: session memory compact (zero LLM cost if memory available)
  result = try_session_memory_compact(messages)
  if result: return result

  # Level 4: LLM-driven summarization (expensive, last resort)
  summary = await llm.summarize(messages[:half])
  return [summary_message(summary)] + messages[half:]
```

Why this ordering matters: large tool results are the most common cause of context bloat, and trimming them costs nothing. A single verbose file read or search result can consume thousands of tokens while contributing nothing to the agent's working memory after the turn it was used. Dropping old messages is next because the first ten turns of a long conversation are usually safe to drop once their content has been acted on. Session memory compaction comes before full LLM summarization: if structured session memory is available and non-empty, it can serve as the summary with zero LLM cost. Only when all cheaper strategies are exhausted should you pay the cost of an LLM summarization call.

**The autocompact circuit breaker.** When autocompaction fails, naive implementations retry on the next turn. If the context is irrecoverably over the limit (a single turn's tool results exceed the entire window), every retry attempt fails, burning API calls indefinitely. The fix is a consecutive failure counter:

```python
type AutoCompactTrackingState = {
  compacted: bool
  turn_counter: int
  consecutive_failures: int    # reset to 0 on success
}

MAX_CONSECUTIVE_FAILURES = 3

function auto_compact_if_needed(messages, tracking):
  # Circuit breaker: stop retrying after N consecutive failures.
  # Without this, sessions where context is irrecoverably over limit
  # hammer the API with doomed compaction attempts on every turn.
  if tracking.consecutive_failures >= MAX_CONSECUTIVE_FAILURES:
    return {was_compacted: false}

  if not should_compact(messages):
    return {was_compacted: false}

  # Try session memory compact first (zero LLM cost)
  result = try_session_memory_compact(messages)
  if result:
    tracking.consecutive_failures = 0
    return {was_compacted: true, result: result}

  try:
    result = compact_conversation(messages)
    tracking.consecutive_failures = 0
    return {was_compacted: true, result: result}
  catch error:
    tracking.consecutive_failures += 1
    return {was_compacted: false}
```

**Image stripping before compaction.** Compaction sends the message history to an LLM to summarize it. If the history contains images or embedded documents, that compaction call can itself hit the prompt-too-long error, the very problem you were trying to solve. Strip images from message history before any compaction API call, replacing them with text markers (`[image]`, `[document]`). This is especially common in sessions where users frequently attach screenshots or files.

**Tool-use/tool-result pair preservation.** When choosing how many messages to keep after compaction, never split a tool-use/tool-result pair. If a kept message contains a tool result, its matching tool-use request must also be in the kept range. The API rejects conversations with dangling tool results (no matching tool-use). After choosing the compaction boundary, scan backwards to include any tool-use messages whose results are in the kept range.

## Long-Term Memory

In-context and summary memory are session-scoped. They disappear when the session ends. Long-term memory persists.

The way long-term memory degenerates is predictable: if you save everything, you save nothing. A memory store with no taxonomy becomes a junk drawer. Facts about user preferences sit next to transient error messages sit next to one-time project notes. The signal-to-noise ratio collapses, and retrieval becomes unreliable.

A closed taxonomy fixes this. Define a finite set of memory types that the extraction process is allowed to create:

- **User:** role, goals, responsibilities, knowledge. Always private, never shared across sessions or users.
- **Feedback:** corrections AND confirmations. Save both, not just corrections. Recording only corrections causes drift toward over-caution because the model never learns what it got right.
- **Project:** ongoing work, decisions, deadlines, mostly team scope. Convert relative dates to absolute on extraction ("next Tuesday" becomes the actual date), or they become meaningless later.
- **Reference:** pointers to external systems such as dashboards, project trackers, Slack channels, docs.

The critical constraint: don't save anything that already has an authoritative source. If it's in the codebase, it's in the codebase. If it's in git history, it's in git history. Long-term memory is for things that have no other home: preferences, decisions, and feedback that exist only in conversation.

## Fact Extraction

Background extraction is not just "run a sub-agent after each turn." It's a complete subsystem with its own agent architecture, concurrency model, and failure recovery strategy.

**The forked-agent pattern.** A background sub-agent runs after each turn to extract facts. This agent shares the parent's prompt cache (so it's cheap, with no cache creation cost), has a hard turn budget (5 turns), and operates with a restricted tool set that only allows reading and writing to the memory directory. It cannot invoke other tools, spawn sub-agents, or perform any action that could interfere with the main loop.

The efficient parallel strategy: turn 1, issue all read calls in parallel for every memory file that might need updating. Turn 2, issue all write calls in parallel. Never interleave reads and writes. This completes well-behaved extractions in 2-4 turns, keeping extraction cheap even in long sessions.

```python
function extract_memories(messages, last_cursor_uuid):
  # Mutual exclusion: if main agent already wrote memories this turn,
  # skip extraction and advance cursor past the write range.
  if has_memory_writes_since(messages, last_cursor_uuid):
    last_cursor_uuid = messages.last().uuid
    return  # done, no background extraction needed this turn

  new_messages = messages_since(messages, last_cursor_uuid)
  memory_manifest = scan_memory_files(memory_dir)

  result = run_forked_agent(
    prompt: build_extract_prompt(len(new_messages), memory_manifest),
    can_use_tool: memory_dir_only_tool_filter,  # restricted tool set
    max_turns: 5,  # hard cap, prevents rabbit-holes
    shares_parent_prompt_cache: true,  # cheap, no cache creation cost
  )

  # Advance cursor ONLY after success. On failure, reconsider same messages next time
  if result.succeeded:
    last_cursor_uuid = messages.last().uuid
```

**Mutual exclusion guard.** If the main agent already wrote to memory files this turn, the background extractor skips its run and advances its cursor past those messages. The main agent and the background extractor are mutually exclusive per turn. They never both write to memory in the same turn.

**The extraction cursor.** The cursor is a UUID identifying the last message that was successfully processed by the extractor. After each successful extraction run, the cursor advances to the most recent message. On failure, the cursor does NOT advance. Those messages are reconsidered on the next run. This gives you at-least-once extraction semantics: a failure never permanently loses messages.

Edge case: if the cursor UUID is missing from the message list (compaction removed the message the cursor was pointing to), fall back to counting all model-visible messages. Never disable extraction permanently because the cursor was lost. The fallback makes it recoverable.

**The entrypoint index.** A single manifest file lists all memory files. Cap this file at a line limit (200 lines) and a byte limit (25KB). Without caps, a single long-line memory file can cause the manifest to consume a disproportionate share of the context budget. When either cap is exceeded, append a warning rather than silently truncating. The warning signals that memory is getting large and should be cleaned up.

## The Context Budget

The context window is not symmetric. You cannot fill it to 100% capacity.

The model needs space to write its response. If you allow the input to fill the window completely, the model will truncate its own output mid-response, a failure mode that is silent, confusing, and hard to debug. Reserve a response buffer (commonly 13,000 to 15,000 tokens for models with 200K+ context windows) before triggering compaction.

This means the effective context budget is:

```python
effective_budget = context_window_size - response_headroom
compaction_trigger = effective_budget * 0.85   # trigger at 85%, not 100%
```

Trigger compaction when you approach the effective budget, not when you hit it. Compaction itself takes tokens (function calls, results). If you wait until you're at the limit, the compaction process may push you over before it finishes.

## Production Considerations

**The circuit breaker saves real money at scale.** Three consecutive compaction failures is the right cutoff because most transient failures (API timeouts, temporary overload) resolve within 1-2 retries. If you've failed 3 times, the session context is almost certainly irrecoverably over the limit, and retrying does not help. Without this circuit breaker, sessions in this state generate one failed compaction attempt per turn for the duration of the session. At scale, this pattern burns hundreds of thousands of API calls per day across a user base.

**MEMORY.md index truncation is not optional.** The memory manifest is injected into the prompt every turn. A 197KB manifest file consumed entirely within 200 lines (all long lines) will consume a substantial portion of your effective context budget before the conversation even begins. The byte cap catches this: when a memory file is pathologically long per-line, the byte cap triggers before the line cap, preventing prompt hijacking via long memory entries.

**Image stripping is a compaction prerequisite.** In sessions where users attach images or screenshots, the accumulated message history can contain megabytes of image data. The compaction API call that's supposed to summarize this history will hit prompt-too-long before it can complete, making it impossible to compact the very content that's causing the overflow. Strip images first, then compact. The text markers (`[image]`, `[document]`) preserve conversation structure without the token cost.

**Cursor UUID fallback prevents permanent extraction failure.** Compaction can remove old messages from the message list. If the cursor was pointing to a removed message, a naive implementation would conclude "cursor not found, something went wrong" and stop extracting. The fallback (count model-visible messages) means extraction continues even when compaction has cleared the anchor message. Without this fallback, long-running sessions that compact frequently would permanently stop extracting facts after the first compaction that hits the cursor.

**Thinking block co-location at compaction boundaries.** When compaction chooses its boundary, it operates on messages. But some LLM responses emit multiple content blocks (a thinking block and a tool-use block) that share the same message identifier but are emitted as separate streaming events. If the compaction boundary falls between these blocks, the kept range contains a tool-use with no associated thinking block, which can cause message normalization to fail. After choosing the boundary, check whether any message at or near the boundary has a sibling thinking block, and include it.

## Best Practices

- **DO** run compaction in cost order: trim tool results, drop old messages, session memory compact, LLM summarize
- **DO** use a closed taxonomy for long-term memory with four types: user, feedback, project, reference
- **DO** track consecutive compaction failures and circuit-break after 3
- **DO** strip images and documents from message history before compaction API calls
- **DO** record both corrections and confirmations in feedback memory. Corrections-only causes drift toward over-caution.
- **DO** convert relative dates to absolute during extraction ("next Tuesday" becomes the ISO date)
- **DO** cap the memory manifest at a line limit and byte limit
- **DON'T** save facts that have an authoritative source elsewhere (git, codebase, external system)
- **DON'T** split tool-use/tool-result pairs when choosing compaction boundaries
- **DON'T** trigger compaction at 100% capacity. Trigger at 85% of effective budget to leave room for the compaction process itself.
- **DON'T** permanently disable extraction when the cursor UUID is missing. Use the count-based fallback.
- **DON'T** give the background extractor unrestricted tool access. Restrict it to memory directory operations only.

## Related

- **[Agent Loop Architecture](/docs/agent-loop)**: The agent loop produces the message list that grows across turns and eventually causes the context pressure this page teaches you to manage.

- **[Prompt Architecture](/docs/prompt-architecture)**: The prompt fills the static portion of the context window before any conversation begins. Memory content is injected into the dynamic zone each turn, so prompt structure and memory size share the same budget.

- **[Tool System](/docs/tool-system)**: Tool results are the primary driver of context bloat. Understanding tool result sizing and trimming connects directly to the compaction pipeline.

- **[Error Recovery](/docs/error-recovery)**: The circuit breaker pattern here (consecutive failure counter that disables retries) mirrors the circuit breaker pattern for API retries. Both protect against runaway retry loops, and both use the same failure-count-with-success-reset structure.

- **[Observability and Debugging](/docs/observability-and-debugging)**: Compaction events, context budget metrics, and cost-per-model tracking all feed into the observability layer. When compaction misfires or context pressure spikes unexpectedly, the session tracing and event log covered on that page are the primary debugging surface.

- **[Pattern Index](/docs/pattern-index)**: All patterns from this page in one searchable list, with context tags and links back to the originating section.

- **[Glossary](/docs/glossary)**: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.


---

# Error Recovery and Resilience

Source: https://claudepedia.dev/docs/error-recovery
Section: Patterns / core-systems

How agent systems handle failure through a tiered escalation ladder, retryability classification, query-source partitioning, and a tool error pipeline that keeps the agent loop running even when individual tools fail.

Tool calls fail. Networks drop packets, APIs return 429s, services go down, code has bugs. An agent without a recovery strategy has two failure modes: it retries forever (running up costs and blocking the task indefinitely), or it crashes on the first error and returns nothing. Neither is acceptable in production.

The question isn't whether failures happen. They will. The question is what the system does next. The answer is an **escalation ladder**: four responses to failure, ordered by cost, applied in sequence from cheapest to most extreme. Understanding the ladder (what each rung handles, what it costs, and when to stop climbing) is the core mental model for building resilient agents.

But the ladder is not the whole story. A production error recovery system also needs to know *which* errors are worth retrying at all, *who* is waiting for the result, and what happens when the failure is inside a tool rather than in the API request. This page covers all three: the escalation ladder, retryability classification, query-source partitioning, the tool error pipeline, and the production details that separate a resilient agent from one that just retries everything.

## The Escalation Ladder

When a tool call fails, we have four options:

1. **Retry**: same operation, same path, wait and try again. Cost: latency. Use for transient failures: network blips, rate limits, momentarily overloaded services.
2. **Fallback**: different implementation of the same capability. Cost: reduced quality or slower path. Use when the primary path has permanently failed.
3. **Degrade**: remove the capability from this session's available tools. Cost: lost feature. Use when the fallback has also failed and the task can still complete without this capability.
4. **Fail**: stop entirely and return an error. Cost: lost task. Use when continuing would cause more harm than stopping, or when there is no way forward.

```mermaid
flowchart TD
    A[Tool call fails] --> B{Retries remaining?}
    B -->|Yes, transient failure?| C[Retry with backoff\ncost: latency]
    C --> D{Succeeded?}
    D -->|Yes| E[Continue]
    D -->|No, retries exhausted| F{Fallback available?}
    B -->|No retries left| F
    F -->|Yes| G[Use fallback path\ncost: reduced quality]
    G --> H{Succeeded?}
    H -->|Yes| E
    H -->|No| I{Task survives without this?}
    F -->|No fallback| I
    I -->|Yes| J[Degrade, remove capability\ncost: lost feature]
    J --> E
    I -->|No, unrecoverable| K[Fail, return error\ncost: lost task]
```

Here is the tiered recovery function in pseudocode:

```python
function execute_with_recovery(operation, config):
  # Rung 1: retry (cost: latency)
  for attempt in range(config.max_retries):
    result = await try_operation(operation)
    if result.succeeded:
      return result
    if not result.is_retryable:
      break
    await backoff(attempt, config.base_delay_ms)

  # Rung 2: fallback (cost: reduced quality)
  if config.fallback is not None:
    result = await try_operation(config.fallback)
    if result.succeeded:
      return result

  # Rung 3: degrade (cost: lost capability)
  if config.is_optional:
    log_degradation(operation.name, "skipping for session")
    return DegradedResult(capability=operation.name)

  # Rung 4: fail (cost: lost task)
  raise RecoveryExhausted(operation=operation.name, last_error=result.error)
```

Each rung has concrete semantics worth naming:

**Retry with backoff.** Exponential backoff with jitter is the standard implementation, not linear delay and not a fixed interval. A fixed 1-second delay under load still overwhelms a struggling service. Jitter spreads the retry storm. The full jitter formula is in Production Considerations below. Check `is_retryable` before entering the retry loop. A 400 Bad Request is generally not retryable, while a 503 Service Unavailable is. The full classification logic is in the section on which errors are retryable below.

**Fallback path.** The fallback is a different implementation of the same capability: a smaller model, a slower API, a cached result, a heuristic approximation. The fallback's contract is: it produces something useful, but not as good as the primary path. Document that degradation explicitly. If the fallback is "pretend the tool succeeded", we've hidden the failure, not recovered from it.

**Degrade gracefully.** Degradation means the agent continues the task without this capability. This is only safe when the task is designed to tolerate partial tool sets. Before deploying an agent, think through which tools are essential (no degradation path) and which are optional (degradation is acceptable). Failing to make this distinction upfront means the degradation logic will be wrong in production.

**Fail cleanly.** A clean failure is better than a confused recovery. When raising at the bottom of the ladder, include the operation name, the last error, and the number of attempts made. The caller (a coordinator, a user, or a monitoring system) needs that information to decide what to do next.

## Circuit Breakers

Tiered recovery handles individual failures. Circuit breakers handle *patterns* of failure.

Without a circuit breaker, a service that's completely down will cause every call to exhaust its full retry budget before escalating. If fifty tool calls are queued, each retrying three times with backoff, we've turned one failure into a hundred failed calls and a multi-minute delay. Circuit breakers prevent this.

A circuit breaker wraps a service and maintains three states:

```python
function call_with_circuit_breaker(service, request, breaker):
  if breaker.state == OPEN:
    raise CircuitOpen(service=service.name, retry_at=breaker.retry_after)

  result = await service.call(request)

  if result.succeeded:
    breaker.record_success()
    if breaker.state == HALF_OPEN:
      breaker.close()               # service recovered, reset to closed
  else:
    breaker.record_failure()
    if breaker.failure_rate > breaker.threshold:
      breaker.open(retry_after=now() + breaker.cooldown)

  return result
```

The three states:

- **Closed**: normal operation. Calls go through. Failures are counted against the threshold.
- **Open**: the service is failing. Calls are rejected immediately without attempting the service. This prevents the retry storm.
- **Half-open**: the cooldown has elapsed. One probe call is allowed through. If it succeeds, the circuit closes. If it fails, the circuit reopens and the cooldown resets.

The circuit breaker belongs *above* the tiered recovery function. It's the first thing evaluated. If the circuit is open, skip straight to fallback or degrade, without burning retry budget on a service that's known to be down.

One important distinction: a circuit breaker changes *state* (it blocks future calls until the service recovers). A model fallback, covered in Production Considerations, changes *identity*. It switches to a different model rather than blocking calls. They solve different problems and can coexist in the same system.

## Fail-Closed Defaults

The escalation ladder assumes our system is fail-closed by default. When no recovery policy is defined for a given failure, the system defaults to the most restrictive behavior. This is the same asymmetric cost argument from tool system design: **the cost of treating a safe operation as unsafe is a small performance hit, while the cost of treating an unsafe operation as safe is data corruption or worse.** When writing recovery configuration, missing values should default to no-retry, no-fallback, mandatory failure, not to unlimited retries with permissive fallbacks.

A build_tool factory that enforces this centrally is cleaner than scattering defaults across tool definitions:

```python
function build_tool(name: str, handler, config: ToolConfig) -> Tool:
  return Tool(
    name: name,
    handler: handler,
    # fail-closed recovery defaults:
    max_retries: config.max_retries ?? 0,       # default: no retries
    fallback: config.fallback ?? None,           # default: no fallback
    is_optional: config.is_optional ?? false,    # default: required (no degrade)
    require_permission: config.require_permission ?? true,  # default: ask
    is_destructive: config.is_destructive ?? true,          # default: assume harmful
  )
```

When `is_optional` is absent, the tool is treated as essential. It fails rather than degrades. When `max_retries` is absent, there are no retries. This is the same fail-closed logic that makes tool registration safe: the defaults are the most restrictive choices, and configuration is opt-in.

See [Tool System Design](/docs/tool-system) for the full treatment of fail-closed defaults and why the asymmetry makes them a safety requirement, not just a convention.

## Which Errors Are Retryable?

Not all errors should be retried. Entering the retry loop without first asking "is this error fixable?" wastes latency budget and can cause harm. Retrying a malformed request doesn't fix the malformation. It just produces N copies of the same failure.

A production retry system classifies each error before attempting recovery. The classification decision tree:

| Status | Retry behavior | Reason |
|--------|----------------|--------|
| 400 (bad request) | Only if context overflow, and only after adjusting max_tokens | Other 400s are not fixable: the request is malformed |
| 401 (auth) | Yes, after refreshing credentials | The API key cache may be stale. Clearing it is worth one retry. |
| 408 (timeout) | Always | Transient: the service may have recovered |
| 409 (conflict) | Always | Transient lock contention: retry with backoff |
| 429 (rate limit) | Enterprise/PAYG only | Subscription users hit window-based limits (not retryable within the window) |
| 529 (overloaded) | Foreground operations only | Background operations should fail fast (see Query-Source Partitioning) |
| 5xx (server error) | Yes, unless server says no | Respect `x-should-retry: false` header. The server knows its state. |
| Connection reset | Yes, after disabling keep-alive | ECONNRESET/EPIPE often indicate the connection was reused after expiry |

In pseudocode, the retryability function looks like this:

```python
function should_retry(error, attempt, max_retries, query_source):
  # Never retry beyond the limit (unless persistent mode)
  if attempt > max_retries:
    raise NonRetryableError(error)

  # 400: only retry if it's a context overflow we can fix
  if error.status == 400:
    overflow = parse_context_overflow(error)
    if overflow:
      adjust_max_tokens(overflow.available_context)
      return true  # retry with adjusted max_tokens
    raise NonRetryableError(error)  # other 400s are not fixable

  # 529 (overloaded): only retry foreground operations
  if error.status == 529:
    if query_source not in FOREGROUND_RETRY_SOURCES:
      raise NonRetryableError(error)  # background: fail fast, no amplification

  # 429 (rate limit): check subscription tier
  if error.status == 429:
    if is_subscription_user and not is_enterprise:
      raise NonRetryableError(error)  # subscription users have window limits

  # 401: refresh credentials, then retry
  if error.status == 401:
    refresh_credentials()
    return true

  # 5xx, 408, 409, connection errors: generally retryable
  if error.status in {408, 409} or error.status >= 500:
    if error.headers.get("x-should-retry") == "false":
      raise NonRetryableError(error)  # server directive overrides our logic
    return true

  return false
```

**Adaptive max_tokens on context overflow.** The 400 path deserves special attention. When a context overflow error occurs, the API response includes the actual token counts: how many tokens the input contained, the model's maximum, and how much space remains. We can parse these values from the error message, compute the available output budget, and reduce `max_tokens` for the retry. This converts what looks like a fatal error into a recoverable one without needing to compact or truncate the conversation.

The floor matters: if the available output budget is below ~3,000 tokens, attempting the retry would produce a response too short to be useful. At that point, fail rather than retry with an unusably small response budget. Adaptive max_tokens is a way to squeeze more life out of a long conversation. It is not a substitute for proper context management.

## The Tool Error Pipeline

Tool execution has its own error model, distinct from the API retry system. The key distinction: **tool errors become messages, not exceptions.**

In the API retry system, a failure causes a delay and a retry. In the tool error pipeline, a failure yields a `tool_result` message into the conversation history. The model reads that message on the next turn and can adapt: try a different tool, ask the user for input, or reformulate the request. The agent loop continues.

This is the invariant that makes tool errors recoverable: **every `tool_use` must have a matching `tool_result`, even on failure.** A conversation with a dangling `tool_use` and no `tool_result` is invalid. The API will reject it. The tool error pipeline ensures that invariant is maintained regardless of what goes wrong during execution.

There are four tool error paths:

```python
async function run_tool(tool_call, tool, context):
  # Path 1: Unknown tool, yield error message, continue
  if not tool:
    yield create_tool_result(
      tool_use_id: tool_call.id,
      content: f"Error: No tool named '{tool_call.name}' is available",
      is_error: true
    )
    return  # loop continues: model sees the error and can adapt

  # Path 2: Abort, yield cancel message, return cleanly
  if context.abort_signal.aborted:
    yield create_tool_result(
      tool_use_id: tool_call.id,
      content: "Operation cancelled",
      is_error: false
    )
    return  # abort propagates through message history without corruption

  # Path 3 and 4: Permission check + execution
  permission = check_permission(tool, tool_call.input)
  if permission.denied:
    yield create_tool_result(            # Path 3: Permission denied
      tool_use_id: tool_call.id,
      content: f"Permission denied: {permission.reason}",
      is_error: true
    )
    return  # model can reframe the request or ask the user

  try:
    async for result in execute_tool(tool, tool_call, context):
      yield result
  except error:
    yield create_tool_result(            # Path 4: Execution failure
      tool_use_id: tool_call.id,
      content: f"Tool '{tool.name}' failed: {error.message}",
      is_error: true
    )
    # No re-raise: the loop continues with complete message history
```

The four paths:

- **Unknown tool**: the model requested a tool that doesn't exist in the current tool set. Yield an error `tool_result`. The model sees this and can adapt: try a different tool, reformulate the request, or ask the user.
- **Abort**: the user or coordinator cancelled the operation mid-execution. Yield a cancel `tool_result` and return cleanly. Abort propagates through the message history without corruption. The conversation remains valid.
- **Permission denied**: the permission middleware rejected the call. Yield a `tool_result` with the denial reason. The model can reframe the request or ask the user for explicit permission. This is not the same as an execution failure because the tool was never called.
- **Execution failure**: the tool ran but threw an error. Catch at the outer boundary, yield a `tool_result` with error detail. No re-raise. The loop continues. The model has complete context for the next turn: it knows what it tried, why it failed, and can choose a different path.

The unifying principle: **the agent loop's message stream must never contain a gap.** Every `tool_use` has a `tool_result`, even on failure. This is what makes tool errors recoverable by design rather than by luck.

## Query-Source Partitioning

Not all operations deserve the same retry behavior during capacity events. When a rate-limit or overload error occurs, treating all operations equally (retrying everything) can amplify the failure rather than recover from it.

**The problem:** background operations (title generation, confidence scoring, suggestion ranking) run in parallel with the main agent loop. If a capacity event causes 529 errors, and each background operation retries three times, and there are N operations running in parallel, the cascade doesn't self-heal. It gets N times worse. The original failure triggers N times max_retries additional requests against an already overwhelmed service.

**The solution: partition operations by who is waiting for the result.**

- **Foreground operations**: the user is blocking on the result. These are worth retrying on capacity errors because the user experience degrades visibly if they fail.
- **Background operations**: the user never sees these results directly. These should fail fast on capacity errors. The cost of failure is invisible to the user. The cost of retrying is paid by the service under load.

The partition should be explicit: a foreground allowlist. Everything not on the allowlist defaults to fail-fast. This is the fail-closed principle applied to retry policy. Don't assume an operation deserves retries. Require it to be declared.

```python
FOREGROUND_RETRY_SOURCES = {
  "main_agent",
  "user_request",
  "coordinator_task",
}

function get_retry_policy(query_source: str) -> RetryPolicy:
  if query_source in FOREGROUND_RETRY_SOURCES:
    return RetryPolicy(
      retry_on_capacity_error: true,
      max_retries: 3,
    )
  else:
    return RetryPolicy(
      retry_on_capacity_error: false,  # fail fast, no amplification
      max_retries: 0,
    )
```

The insight here is asymmetric: the benefit of retrying a background operation is low (the user doesn't see the result), and the cost is high (it amplifies capacity events). The default for anything not explicitly in the foreground allowlist is no retry on capacity errors. Adding operations to the foreground list is a deliberate act. It means "the user is blocking on this result and deserves a retry."

## Production Considerations

**The jitter formula with real numbers.** The standard exponential backoff formula with jitter:

```python
base_delay = min(500ms * 2^(attempt - 1), 32s)
jitter = random() * 0.25 * base_delay
total_delay = base_delay + jitter
```

With 500ms base and 32s maximum: attempt 1 = 500ms plus or minus 125ms, attempt 2 = 1s plus or minus 250ms, attempt 3 = 2s plus or minus 500ms, eventually capping at 32s plus or minus 8s. The 25% jitter range spreads retries across a 2-second window at maximum delay rather than a synchronized thundering herd. Fixed delays without jitter cause retry storms: if a hundred clients all fail at the same moment and all retry after exactly 1 second, they all hit the recovering service at exactly the same moment. The second wave is as bad as the first.

The server can override this calculation entirely. When the response includes a `retry-after` header, honor it. The server knows its own cooldown period better than any client formula does. Use the server-provided delay instead of the backoff calculation. The exception: the 6-hour session cap still applies in persistent retry mode (below) to prevent a pathological `retry-after` value from waiting indefinitely.

**Persistent retry mode for unattended sessions.** For CI pipelines and automated sessions where no user is watching, a persistent retry mode with unlimited retries and a hard time cap (6 hours) is more appropriate than a fixed retry count. The challenge: if the next retry is 30 minutes away (service outage), simply sleeping for 30 minutes will cause the host environment to mark the session idle and terminate it.

The solution: break long sleeps into short chunks (30 seconds each), and yield a heartbeat event to the host environment after each chunk. This keeps the session alive during extended waits. When the retry-after header specifies a long wait, chunk it:

```python
function sleep_with_heartbeat(delay_ms: int, heartbeat_fn):
  chunk_ms = 30_000  # 30-second chunks
  remaining = delay_ms
  while remaining > 0:
    sleep(min(chunk_ms, remaining))
    heartbeat_fn()  # keeps the session alive
    remaining -= chunk_ms
```

The 6-hour cap is the critical safety valve. Without it, an automated session could wait indefinitely if the service never recovers. With the cap, the session eventually gives up and reports failure. The operator can investigate and retry manually.

**Model fallback on repeated capacity errors.** After N consecutive overload errors on the primary model, trigger a fallback to a different model. This is distinct from a circuit breaker: the circuit breaker prevents calls to a failing service, while the model fallback switches to a different model identity. The fallback model still serves the same request. It just uses a different model name.

The fallback trigger is a specific signal (not a generic error) because the caller needs to handle it differently: switch the model name and retry the original request, rather than entering the standard retry loop. The primary model circuit breaker and the model fallback are orthogonal. Both can be active at the same time.

**Parse error messages for exact token counts.** When a context overflow error occurs, the API response includes the precise token counts: how many tokens the input contained, the model's maximum, and the available output budget. Do not guess these values. Parse them from the error message, compute the available space, and set `max_tokens` exactly for the retry. Guessing too high repeats the 400. Guessing too low wastes output budget. The API is giving us the exact answer. Use it.

## Best Practices

- **DO** classify errors before retrying. Not all errors are retryable. A 400 bad request is a bug in our request, not a transient failure.
- **DO** partition operations into foreground (retry on capacity errors) and background (fail fast on capacity errors). The default should be fail fast.
- **DO** use jitter in exponential backoff. Fixed delays cause synchronized retry storms when many clients fail simultaneously.
- **DO** yield error `tool_result` messages instead of raising exceptions from tool execution. The message stream must remain valid for the next turn.
- **DO** honor the server's `retry-after` header when present. The server's timing beats our formula.
- **DO** have a hard time cap on persistent retry mode (6 hours prevents unbounded waiting in unattended sessions).
- **DO** break long sleeps into 30-second chunks with heartbeats in persistent mode. Host environments terminate idle sessions.
- **DON'T** retry 400 errors unless we can fix the request. Parse the error, check if it's a context overflow, and only retry if max_tokens adjustment is possible.
- **DON'T** retry background operations during capacity cascades. Gateway amplification turns one failure into N times max_retries failures.
- **DON'T** let tool errors bubble up as exceptions. Every `tool_use` must have a matching `tool_result`, even on failure.
- **DON'T** set the adaptive max_tokens floor too low. Below ~3,000 tokens, the response is too short to be useful. Fail rather than produce useless output.
- **DON'T** conflate model fallback with circuit breaking. They solve different problems (identity vs. state) and can coexist.

## Related

- **[Tool System Design](/docs/tool-system)**: Fail-closed defaults originated in tool metadata design: when safety flags are missing, the system defaults to the most restrictive interpretation. That same principle applies to recovery configuration. Also covers the tool lifecycle, which connects directly to the tool error pipeline because execution failure handling is part of the dispatch boundary.
- **[Agent Loop](/docs/agent-loop)**: The agent loop must survive tool errors. The tool error pipeline keeps the loop running by maintaining a valid message stream even when tools fail. Understanding the loop's turn structure clarifies why a dangling `tool_use` without a `tool_result` breaks everything.
- **[Memory and Context](/docs/memory-and-context)**: The autocompact circuit breaker uses the same consecutive-failure pattern as the API circuit breaker: after 3 consecutive compaction failures, autocompact is disabled for the session. Context overflow errors and adaptive max_tokens adjustment are the two systems' meeting point.
- **[Prompt Architecture](/docs/prompt-architecture)**: Retry behavior and error thresholds can be calibrated in the system prompt. Prompt design affects how the model responds to tool errors in its message history. A well-designed identity section helps the model adapt when tools fail rather than spinning or giving up.
- **[Safety and Permissions](/docs/safety-and-permissions)**: Permission denials are a common error condition that the recovery system must handle. The denial tracking threshold (3 consecutive / 20 total) triggers mode escalation, which feeds back into the error recovery pipeline.
- **[Observability and Debugging](/docs/observability-and-debugging)**: The event log records tool errors, hook failures, and permission denials as first-class events. These logged signals are the raw data that error recovery classifies as retryable or permanent.

- **[Pattern Index](/docs/pattern-index)**: All patterns from this page in one searchable list, with context tags and links back to the originating section.

- **[Glossary](/docs/glossary)**: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.


---

# Safety and Permissions

Source: https://claudepedia.dev/docs/safety-and-permissions
Section: Patterns / core-systems

How a six-source permission cascade, five permission modes, and three resolution handlers keep agents within bounds, with denial tracking, shadow rule detection, and multi-agent permission forwarding.

Agents act autonomously, and that means they can cause damage autonomously. A file-writing agent that overwrites the wrong directory, an API-calling agent that sends unauthorized requests, a search agent that leaks data it shouldn't have accessed: these failures aren't hypothetical. They happen when the permission model is too simple.

The naive model is a single yes/no check before each action: does this agent have permission? But "this agent" isn't a single thing. It might be operating on behalf of a user, inside a specific project, under a policy set by an organization, with permissions explicitly granted during the current session. The permission to act comes from multiple sources at once, and they don't always agree. The model that makes this tractable is a **cascade**: a prioritized sequence of six policy sources, where the first matching rule wins and no match means deny. Every decision is logged with its source and reason type, so the cascade is auditable, not just mechanical.

Beyond the cascade sits a second layer: **permission modes** that provide a global semantic override (planning mode auto-denies all writes, bypass mode auto-approves everything, silent deny mode refuses all unlisted actions). And a third layer: **resolution handlers** that route the decision differently depending on whether you're in an interactive session, a coordinator, or a headless worker. These three layers (cascade, mode, handler) work together to make agent permission behavior both predictable and auditable.

## The Permission Cascade

The cascade evaluates six ordered sources. The first source that has an opinion on an action wins. If nothing matches, the default is DENY (fail-closed).

```mermaid
flowchart TD
    A[Agent requests action] --> B{Bypass-immune check\npasses?}
    B -->|No, always blocked| Z[DENY, unconditional]
    B -->|Yes| C{policySettings\nmatch?}
    C -->|DENY| Z
    C -->|ALLOW| Y[ALLOW]
    C -->|No match| D{projectSettings\nmatch?}
    D -->|DENY| Z
    D -->|ALLOW| Y
    D -->|No match| E{localSettings\nmatch?}
    E -->|DENY| Z
    E -->|ALLOW| Y
    E -->|No match| F{userSettings\nmatch?}
    F -->|DENY| Z
    F -->|ALLOW| Y
    F -->|No match| G{cliArg\nmatch?}
    G -->|DENY| Z
    G -->|ALLOW| Y
    G -->|No match| H{session\nmatch?}
    H -->|DENY| Z
    H -->|ALLOW| Y
    H -->|No match| I[Default: DENY\nfail-closed]
```

The six sources in priority order:

1. **policySettings**: enterprise-managed rules pushed to all users. Highest priority. Cannot be overridden by any lower source.
2. **projectSettings**: committed to version control and shared with the whole team. Narrower than policy but broader than personal settings.
3. **localSettings**: gitignored per-project personal overrides. Set by the developer for their own use on this project only.
4. **userSettings**: global personal settings (e.g., `~/.agent` config). Apply across all projects for this user.
5. **cliArg**: `--allow` / `--deny` flags passed at launch time. Convenient for one-off sessions without modifying config files.
6. **session**: permissions granted interactively during this conversation. Lowest priority. Expire when the session ends.

The six-source cascade evaluation in pseudocode:

```python
# Every matched rule carries source, behavior, and decision reason for audit
RULE_SOURCES = [
  "policySettings",    # enterprise-managed, pushed to all users
  "projectSettings",   # committed to git, shared with team
  "localSettings",     # gitignored per-project, personal
  "userSettings",      # global ~/.agent settings, personal
  "cliArg",            # --allow / --deny at launch time
  "session",           # granted during this conversation, expires on exit
]

function evaluate_permission(action, context) -> Decision:
  # Bypass-immune checks first: cannot be overridden by any source
  if not passes_scope_bounds(action, context.scope):
    return DENY(source="scope_check", reason="scope_bounds_violation")

  # Build rule lists from all sources
  all_deny_rules  = flatten(context[src].deny_rules  for src in RULE_SOURCES)
  all_ask_rules   = flatten(context[src].ask_rules   for src in RULE_SOURCES)
  all_allow_rules = flatten(context[src].allow_rules for src in RULE_SOURCES)

  # Deny checked first: most restrictive wins within source priority
  for rule in all_deny_rules:
    if rule.matches(action):
      log(source=rule.source, behavior="deny", reason="deny_rule")
      return DENY(f"deny rule from {rule.source}")

  # Ask next: user will be prompted
  for rule in all_ask_rules:
    if rule.matches(action):
      log(source=rule.source, behavior="ask", reason="ask_rule")
      return ASK(reason=rule)

  # Allow last
  for rule in all_allow_rules:
    if rule.matches(action):
      log(source=rule.source, behavior="allow", reason="allow_rule")
      return ALLOW

  # No match: apply mode-based default (fail-closed if mode == "default")
  return mode_default_decision(context.mode, action)
```

Three design choices embedded here are worth naming explicitly:

**First-match wins.** Each source returns ALLOW, DENY, or ABSTAIN (no opinion). The cascade stops at the first non-ABSTAIN decision. Given any action and context, you can trace exactly which source made the decision. There's no ambiguity and no averaging.

**Deny beats allow within a source.** When building the rule lists, deny rules are checked before ask rules, which are checked before allow rules. Within any single source, the most restrictive opinion wins.

**Bypass-immune checks are not policy.** Scope bounds checking (is the agent trying to act outside its designated directory?) runs before the cascade and cannot be overridden by any source. If scope checks were inside the cascade, a sufficiently privileged policy source could override them, making "scope" meaningless. Bypass-immune checks run unconditionally.

**Every decision is logged.** The audit trail carries the matched source, the behavior (allow/deny/ask), and the reason type (rule/hook/classifier/mode/safety_check). This is what makes the cascade inspectable when debugging unexpected denials.

## Permission Modes

Permission modes sit on top of the cascade and provide a global semantic override. When a mode's rule fires, it bypasses normal cascade evaluation for that action type. Five modes are externally addressable:

```python
# The five externally-addressable permission modes
type PermissionMode =
  | "default"           # ask for all unlisted tools (standard interactive behavior)
  | "plan"              # read-only: all tools with write effects auto-denied
  | "acceptEdits"       # file edit tools auto-approved, everything else still asks
  | "bypassPermissions" # all tools auto-approved without prompting (requires explicit availability)
  | "dontAsk"           # all unlisted tools auto-denied silently, no prompt

# Mode to default decision when no cascade rule matches
function get_mode_default(mode: PermissionMode, tool: Tool) -> Decision:
  match mode:
    "plan":
      if tool.has_write_effect: return DENY("plan mode: write tools blocked")
      else: return ASK       # reads still ask: plan mode is not auto-approve for reads
    "acceptEdits":
      if tool.is_file_edit: return ALLOW
      else: return ASK
    "bypassPermissions":
      return ALLOW           # all tools, no questions
    "dontAsk":
      return DENY            # all unlisted tools, silently
    "default":
      return ASK             # standard prompt
```

Mode cycling is deterministic: `default` to `acceptEdits` to `plan` to `bypassPermissions` to `default`. The cycling order doesn't represent a severity gradient. It's a UI cycle. `bypassPermissions` requires explicit availability (guarded by a feature gate checked at startup), so agents in restricted environments skip it in the cycle.

**`dontAsk` vs `bypassPermissions`** are both "no prompt" modes with opposite defaults. `dontAsk` silently denies all unlisted tools. The agent looks stuck but won't bother the user. `bypassPermissions` silently approves everything, providing maximum automation. Confusing them has real consequences.

## Three-Handler Architecture

Permission resolution follows different paths depending on execution context. There are three handler types, and the handler is determined by context, not by rule configuration.

```python
# Handler selection based on execution context
function resolve_permission(tool, input, context) -> Decision:
  if context.is_swarm_worker:
    return swarm_worker_handler(tool, input, context)
  if context.is_coordinator_mode:
    return coordinator_handler(tool, input, context)
  return interactive_handler(tool, input, context)

# Interactive handler: races multiple resolution paths concurrently
function interactive_handler(tool, input, context) -> Decision:
  resolution = create_resolution_promise()

  # All four resolution paths run concurrently: first to resolve wins
  spawn: permission_hooks(tool, input, context)     # fast, local rule evaluation
  spawn: classifier(tool, input, context)           # LLM-based classification (slower)
  spawn: bridge_response(tool, input, context)      # remote UI (e.g. web interface)
  spawn: channel_relay(tool, input, context)        # external channels (e.g. Telegram)
  show_dialog(tool, input, context)                 # user interaction: always present as floor

  return await first_to_resolve(resolution)

# Coordinator handler: sequential, no racing
function coordinator_handler(tool, input, context) -> Decision:
  if decision = run_permission_hooks(tool, input, context): return decision
  if decision = run_classifier(tool, input, context): return decision
  return show_dialog_and_wait(tool, input, context)

# Swarm worker handler: forward to leader if classifier can't decide
function swarm_worker_handler(tool, input, context) -> Decision:
  if decision = try_classifier(tool, input, context):
    return decision  # classifier auto-approved or auto-denied

  # Can't show UI in headless worker: forward to leader
  request = create_permission_request(tool, input)
  register_callback(request.id)      # register BEFORE sending (race prevention)
  send_to_leader_mailbox(request)
  show_pending_indicator(tool)

  return await leader_response(request.id, context.abort_signal)
```

The interactive handler races four resolution paths simultaneously. This means a fast local rule (hooks) can approve an action before the user even sees a dialog. The 200ms grace period: if a keypress arrives within 200ms of the permission dialog appearing, it's treated as a pre-existing keystroke from a previous command, not as user interaction. The classifier is still allowed to auto-approve during that window, preventing accidental keypresses from canceling the classifier check prematurely.

The coordinator handler runs sequentially: hooks, then classifier, then dialog. No racing, because coordinator sessions are designed for predictable sequential tool use.

**Why does the interactive handler race its paths?** Speed. In a typical interactive session, the user is watching and fast hooks can approve a safe action (like reading a file that's already in the project) before the dialog even renders. The user sees the tool result without ever seeing a prompt. Racing the four paths achieves this without sacrificing the safety floor. The user dialog is always "in the race" and will show if no other path resolves first. The classifier is the expensive path (LLM inference), and the hooks are cheap (local rule matching). Racing them means fast paths pre-empt slow ones without needing to explicitly time-box the classifier.

The swarm worker handler is covered fully in [Multi-Agent Permission Forwarding](#multi-agent-permission-forwarding).

## Bypass-Immune Checks and Mode Guards

Some safety checks run before the cascade and cannot be overridden by any policy source, any mode, or any rule.

**Scope bounds checking** is the canonical example. If the agent is constrained to a specific working directory, writing outside that directory is denied unconditionally, not as a policy decision but as an architectural constraint. If scope checking were inside the cascade, `bypassPermissions` mode could override it. Bypass-immune checks survive mode transitions by design.

**`bypassPermissions` mode is itself guarded.** The mode exists for automated pipelines, but it's controlled by a feature gate checked at startup. If the gate is closed (default in most deployments), the mode is unavailable and removed from the cycle. Entering bypass mode without explicit authorization is blocked at the infrastructure level.

**Dangerous pattern stripping** applies in certain permissive modes. Rules matching patterns like interpreters (python, node, ruby), shells (bash, zsh), code runners (npx, eval, exec, sudo), and similar are stripped before evaluation. This prevents a broad rule like `allow python:*` from acting as a blanket code execution bypass in modes that auto-approve unlisted tools.

The invariant: bypass-immune checks protect properties that no policy source may waive. If you find yourself thinking "but with bypass mode, we could..." that's exactly the scenario bypass-immune checks exist to prevent.

## Graduated Trust

Not all instructions carry the same authority. When an agent receives instructions from multiple sources (the system prompt, the user's message, a tool result, a sub-agent's output) those sources have different trust levels, and the system must enforce that hierarchy.

The standard ordering:

- **System prompt**: highest authority. Set by the developer who deployed the agent. Can establish and expand agent permissions.
- **User turn**: medium authority. The human operator. Can use permissions within what the system prompt allows, but cannot exceed them.
- **Tool results**: lower authority. External data returned by tool calls. Cannot expand permissions and can only inform decisions.
- **Sub-agent output**: lowest authority in a multi-agent system. A worker agent's output should be treated as data, not as instruction.

The critical invariant: **an agent cannot grant itself elevated permissions.** Trust flows downward. A system prompt can grant broad permission, a user can use that permission, but neither a tool result nor a sub-agent claiming to have elevated access can override the hierarchy above them.

This matters especially in multi-agent systems. A coordinator that receives a message from a worker claiming "the user approved X" must not act on that claim. The worker doesn't have the authority to relay user approvals. Only instructions that arrive through the original user turn or system prompt carry that trust level.

## Multi-Agent Permission Forwarding

Workers run in isolated execution contexts and can't show UI. When a worker needs a permission decision that can't be resolved locally (classifier can't auto-approve, no matching rule), it delegates to the leader via a mailbox protocol.

```python
# Worker side: create request, register callback, then send
function request_permission_from_leader(tool, input, context) -> Decision:
  request = PermissionRequest {
    id:         generate_id(),
    tool:       tool.name,
    input:      input,
    worker_id:  context.agent_id,
  }

  # CRITICAL: Register callback before sending to leader
  # If we send first and leader responds before we register, we lose the response
  callback = register_callback(request.id, context.abort_signal)

  # Write to leader's mailbox
  write_to_mailbox(context.leader_mailbox_path, request)
  show_pending_indicator(tool)

  # Wait for leader response (or abort if session ends)
  response = await callback

  if response.type == "cancel":
    return DENY("leader did not respond before session ended")

  # Leader may have modified the tool input (e.g. sanitized a path)
  if response.updated_input:
    input = response.updated_input

  return response.decision

# Leader side: poll mailbox, show UI, send response
function poll_and_respond_to_workers():
  for request in read_mailbox(leader_mailbox_path):
    show_worker_permission_dialog(request)      # blocks until user decides
    response = PermissionResponse {
      id:            request.id,
      decision:      user_decision,
      updated_input: maybe_sanitized_input,     # leader can modify input
    }
    write_to_worker_mailbox(request.worker_id, response)
```

**The race condition guard** is the callback registration order. The worker registers its callback before writing to the mailbox, not after. If the sequence were reversed (write first, then register), the leader could respond in the window between the write and the registration, and the response would be permanently lost. The worker waits forever. Always register callbacks before any operation that could trigger the response.

**The `updated_input` field** lets the leader modify what the tool actually runs. If a worker requests `write /tmp/sensitive-file`, the leader can respond with an approved write to a sanitized path. The worker executes the leader's modified input, not its original request.

**AbortSignal fires on session end.** If the leader's session ends while a worker is waiting, the worker's abort signal fires and it resolves with a cancel decision, preventing a hung worker with no recourse.

## Shadow Rule Detection

A shadowed rule is a rule that can never be reached. Shadow rules are a silent correctness problem: the developer believes a specific permission is granted, but it never fires.

Two shadow types:

**Deny-shadowing**: a broad deny rule blocks a specific allow rule. Example: you add `bash(ls:*)` to allow specific directory listings, but also have `bash` in the deny list. The deny list is checked before the allow list, so the specific allow never matches. The tool behaves as if always denied.

**Ask-shadowing**: a tool-wide ask rule means the user is always prompted before the specific allow can be checked. If `bash` is in the ask list and `bash(ls:*)` is in the allow list, the ask rule fires first. The specific allow is never reached because the user is prompted every time regardless.

```python
# Shadow rule detection at write time (not at evaluation time)
function detect_shadowed_rules(new_rules: RuleSet) -> list[Warning]:
  warnings = []

  for allow_rule in new_rules.allow_rules:
    # Check for deny-shadowing: any deny rule that would match this allow's pattern
    for deny_rule in new_rules.deny_rules:
      if deny_rule.pattern.subsumes(allow_rule.pattern):
        warnings.append(
          ShadowWarning(
            shadowed=allow_rule,
            shadower=deny_rule,
            type="deny_shadow",
            message=f"'{deny_rule.pattern}' will always deny before '{allow_rule.pattern}' is checked"
          )
        )

    # Check for ask-shadowing: any ask rule that would always prompt before this allow
    for ask_rule in new_rules.ask_rules:
      if ask_rule.pattern.subsumes(allow_rule.pattern):
        if not sandbox_auto_allow_enabled(new_rules):
          warnings.append(
            ShadowWarning(
              shadowed=allow_rule,
              shadower=ask_rule,
              type="ask_shadow",
              message=f"'{ask_rule.pattern}' will always prompt before '{allow_rule.pattern}' is checked"
            )
          )

  return warnings
```

Shadow rule detection runs when rules are written, not at evaluation time. This is the right place. Discovering a shadow at evaluation time means the developer has already made a decision based on a false belief about the permission state. Detecting it at write time lets you warn before the belief takes hold.

**The sandbox exception**: when sandbox auto-allow is enabled, personal ask-rules don't shadow bash allow-rules. The sandbox's auto-approval mechanism bypasses the normal ask rule check, so the allow rule can fire.

## Denial Tracking and Auto-Fallback

Classifier-based permission systems have a failure mode: silent infinite rejection loops. A classifier that fails closed (always denies on error, or always denies a particular pattern) can spin forever with no user feedback. The agent looks stuck. No dialog appears. The user doesn't know why.

Denial tracking solves this by counting consecutive and total denials, then escalating to user dialog when thresholds are crossed.

```python
type DenialState = { consecutive: int, total: int }

LIMITS = { max_consecutive: 3, max_total: 20 }

# Check before running classifier: should we skip to dialog?
function should_escalate_to_dialog(state: DenialState) -> bool:
  return (
    state.consecutive >= LIMITS.max_consecutive or
    state.total >= LIMITS.max_total
  )

# Update denial state after a classifier decision
function update_denial_state(decision: Decision, state: DenialState) -> DenialState:
  match decision:
    DENY:  return { consecutive: state.consecutive + 1, total: state.total + 1 }
    ALLOW: return { consecutive: 0, total: state.total }  # reset consecutive on success
    ASK:   return { consecutive: 0, total: state.total }  # ask = escalation, also resets

# Full permission resolution with denial tracking
function resolve_with_tracking(tool, input, context) -> Decision:
  if should_escalate_to_dialog(context.denial_state):
    return show_dialog_and_wait(tool, input, context)  # skip classifier

  decision = run_classifier(tool, input, context)
  context.denial_state = update_denial_state(decision, context.denial_state)

  if decision == ASK or decision == DENY_TO_DIALOG:
    return show_dialog_and_wait(tool, input, context)

  return decision
```

The thresholds (3 consecutive, 20 total) represent observed frustration thresholds, not theoretical values. After 3 consecutive denials in a row, something is wrong. After 20 total, even with intermittent successes, the classifier is denying too much. Either threshold triggers an escalation to user dialog so the human can take over.

The consecutive counter resets on any successful approval. A single approved action indicates the classifier isn't stuck. The total counter never resets within a session, which is intentional: it catches diffuse denial patterns that the consecutive counter would miss.

**Why two counters?** The consecutive threshold catches obvious stuck-classifier situations: 3 denials in a row with no approvals in between is the classifier looping on one action. The total threshold catches a more subtle pattern: the classifier that approves some things but denies a high fraction of requests overall. In a long session, 20 total denials might be spread across 5 different tools, each denied 4 times. Consecutive never hits 3, but the aggregate friction is clearly wrong. Both conditions should escalate independently.

**What happens after escalation?** The user dialog appears and the user decides. If the user approves, the consecutive counter resets (approval reset). If the user denies (actively, via the dialog), that denial does NOT increment the denial tracker. Human denials are expected and correct. The tracker only counts classifier denials. After user interaction, the classifier gets another chance on subsequent requests.

## Production Considerations

**Six sources, not four: the team/enterprise distinction is real.** `policySettings` (enterprise push) and `projectSettings` (git-committed) can conflict when both define rules for the same tool. Because policy has higher priority than project, a team cannot override enterprise policy via their committed settings. Developers who find "project settings not working" often don't realize a higher-priority source is shadowing them. Always inspect the full six-source cascade when debugging unexpected denials, not just the sources you configured.

**Classifier-based modes need denial tracking, not just thresholds.** A classifier that fails closed will spin forever with no user feedback if denial tracking isn't implemented. The 3-consecutive / 20-total thresholds aren't arbitrary. They're calibrated to the difference between "the classifier is working but cautious" and "the classifier is stuck." Implementing a classifier permission system without denial tracking is a recipe for invisible agent hangs.

**Bypass-immune checks must survive mode transitions.** If scope bounds checking is gated behind "is in bypassPermissions mode?", bypass mode breaks the invariant it's supposed to protect. The correct structure is: bypass-immune checks run before cascade evaluation, unconditionally, regardless of mode. The check itself is immune, not the result of the check.

**Shadow rule detection is a security issue, not just UX.** A shadowed allow rule that silently fails is a correctness problem with security implications: the developer believes a narrowly-scoped permission was granted, but in practice the broad deny rule is in force. Without detection, the response to "my allow rule isn't working" is to widen the allow rule, which often means granting more access than intended.

**The interactive handler's 200ms grace window prevents a specific class of accidental approvals.** In interactive sessions, a classifier might auto-approve an action before the dialog appears. If the user's previous keypress arrives within 200ms of the dialog, it could look like "user dismissed the dialog immediately," which would then be treated as user approval of the auto-approval. The grace window prevents this by deferring user-interaction detection briefly. Without it, fast typers trigger false approvals.

**The register-before-send ordering in permission forwarding is a race condition, not style.** If a worker writes to the leader mailbox before registering its callback, the leader can respond in the window between the write and the registration, and the response is permanently lost. The worker waits forever. Always register callbacks before any operation that could trigger the response.

## Best Practices

**Do: fail closed on no match.** The cascade's default is DENY when nothing matches. Fail-open ("if uncertain, allow") looks convenient but inverts the cost asymmetry. Wrongly allowing an action can be irreversible. Wrongly blocking one costs a prompt.

**Don't: gate bypass-immune checks behind mode.** Mode controls the cascade. Bypass-immune checks run before the cascade. If you put scope checking inside the mode logic, bypass mode can defeat it. The check and the mode are at different architectural layers on purpose.

**Do: implement denial tracking with escalation.** Any permission system that uses a classifier needs denial tracking. A classifier that rejects an action will keep rejecting it without escalation. Set thresholds: 3 consecutive denials or 20 total should escalate to user dialog regardless of what the classifier says.

**Don't: let workers show UI directly.** Workers run in headless execution contexts. Code that calls a user-facing dialog from a worker context will either hang (no UI to render into) or throw. Route worker permission requests to the leader via mailbox.

**Do: detect shadowed rules at write time.** Check new rules against existing deny and ask rules before persisting them. Warn on any allow rule that would be shadowed by a broader deny or ask. The developer making the write-time decision is in the best position to resolve the conflict.

**Don't: confuse `dontAsk` and `bypassPermissions`.** Both modes skip the user prompt, but they're opposites: `dontAsk` denies everything silently, while `bypassPermissions` approves everything silently. The word "bypass" means bypassing the *prompt*, not bypassing *safety*. But in the `dontAsk` case, "bypassing the prompt" means the request is silently refused.

**Do: log the full decision path.** Every permission decision should record the source that matched (which of the six), the behavior (allow/deny/ask), and the reason type (rule/classifier/mode/hook/safety_check). When something goes wrong in production, "permission denied" is not enough. You need to know which source made the decision and why.

## Related

- **[Tool System Design](/docs/tool-system)**: The fail-closed principle governing the permission cascade originated in tool metadata design: when a tool's safety flags are missing, default to the most restrictive interpretation. The same asymmetric cost argument applies to both systems. Tool system covers how tools declare their own permission requirements. This page covers how those requirements are evaluated.

- **[Multi-Agent Coordination](/docs/multi-agent-coordination)**: Covers the full worker spawning model, mailbox communication, and session reconnection. The permission forwarding protocol on this page (workers delegate to leader) is the safety-specific view. Multi-agent coordination covers the broader communication architecture and backend abstraction that makes mailbox forwarding possible.

- **[Streaming and Events](/docs/streaming-and-events)**: The terminal input event system handles raw user keystrokes including permission grant/deny responses. The capture/bubble dispatch model routes permission dialog input without affecting underlying components.

- **[MCP Integration](/docs/mcp-integration)**: MCP tool annotations (`destructive_hint`, `read_only_hint`) feed directly into the permission cascade. External tools from MCP servers go through the same permission checks as built-in tools.

- **[Pattern Index](/docs/pattern-index)**: All patterns from this page in one searchable list, with context tags and links back to the originating section.

- **[Glossary](/docs/glossary)**: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.


---

# Streaming and Events

Source: https://claudepedia.dev/docs/streaming-and-events
Section: Patterns / core-systems

How event-driven streaming delivers agent results as they happen, through typed events, producer-consumer pipelines, priority-based dispatch, capture/bubble phases, and a screen-diffing output model.

A user watching an agent work doesn't want to stare at a blank screen for 30 seconds. They want to see text appearing, tools being called, progress being made. Streaming turns a black-box wait into a transparent process. But naive streaming (just printing tokens as they arrive) misses the architectural opportunity. A well-designed event system makes agent output composable, observable, and safe under load. A poorly-designed one buries you in tight coupling between the agent loop and every consumer that wants to watch it.

The right framing is not "how do we print faster?" It is "what is the contract between the agent and its observers?" Get that contract right, and streaming a UI, logging a supervisor, and wiring a network serializer all become the same operation: subscribe and handle the types you care about.

There are two distinct event systems in a complete agent implementation. The agent event stream carries LLM and tool lifecycle events: `TextDelta`, `ToolDispatch`, `Complete`. The terminal input event system carries user interaction events: keystrokes, resize, scroll. They are separate buses. Confusing one for the other is a common architectural mistake that produces subtle, hard-to-trace bugs.

## The Event Model

The event type system is the contract between the agent loop and its observers. Here is the full model, including type definition, producer, and consumer:

```python
# Event type system: the universal interface
type AgentEvent =
  | RequestStart   { turn_id: string }
  | TextDelta      { turn_id: string, text: string }
  | ToolDispatch   { turn_id: string, tool: string, args: object }
  | ToolResult     { turn_id: string, tool: string, result: object }
  | Complete       { turn_id: string, final_text: string }
  | ErrorEvent     { turn_id: string, error: string }

# Producer: agent loop yields events as they happen
async function* agent_loop(messages, tools) -> AgentEvent:
  response = await llm.call(messages, tools)
  yield RequestStart(turn_id=response.id)

  for delta in response.stream():
    yield TextDelta(turn_id=response.id, text=delta)

  if response.tool_calls:
    for call in response.tool_calls:
      yield ToolDispatch(turn_id=response.id, tool=call.name, args=call.args)
      result = await tools[call.name].run(call.args)
      yield ToolResult(turn_id=response.id, tool=call.name, result=result)
  else:
    yield Complete(turn_id=response.id, final_text=response.text)

# Consumer: handles the event types it cares about, ignores the rest
async function render_to_ui(event_stream):
  async for event in event_stream:
    match event:
      TextDelta:    ui.append_text(event.text)
      ToolDispatch: ui.show_spinner(event.tool)
      ToolResult:   ui.show_result(event.tool, event.result)
      Complete:     ui.finalize()
```

The event model makes streaming **composable**. A supervisor can observe an agent's event stream without the agent knowing it's being watched. A logging system can subscribe to the same stream as the UI. Any new consumer type just subscribes and handles the events it cares about, without touching the producer. Add a new consumer: zero changes to the agent loop. Remove a consumer: same. The producer and consumer are fully decoupled through the typed event contract.

Three complementary angles build on this foundation:

**Progressive disclosure.** Users see text as it's generated, tool calls as they're dispatched, partial results before completion. The pattern is: yield partial state when it's useful to the consumer, not just at completion. An agent that shows its work is an agent users can trust.

**Producer-consumer pipeline.** The agent loop is the producer, yielding events as they happen. Consumers process them at their own pace. The pipeline can have multiple stages: agent loop to event filter to serializer to network socket to client renderer. The key systems concern is backpressure: what happens when a consumer is slower than the producer.

## Backpressure and Buffering

The producer-consumer pipeline has a fundamental tension: the producer can generate events faster than the consumer can process them. This is backpressure, the pressure from a slow consumer pushing back against a fast producer. How you handle it is a design decision with three options on a spectrum:

**No buffer (blocking producer).** The generator suspends at each `yield` until the consumer calls `next`. Maximum safety (the producer never runs ahead of the consumer) but minimum throughput if the consumer is slow. This is the natural behavior of async generators. It's the right default when consumer slowness is acceptable (batch processing, logging to disk).

**Bounded buffer.** The producer runs ahead up to N events, then blocks. Balances throughput and memory. The buffer absorbs consumer jitter: a consumer that processes events in bursts rather than steadily. The buffer size is the explicit trade-off. Larger means more throughput smoothing, more memory use, and more events potentially lost if the consumer crashes.

**Unbounded buffer.** The producer never blocks. All events are queued immediately. Maximum throughput, but unbounded memory use. Safe only when the consumer is reliably faster than the producer over time (local in-memory consumers, fast file writes). Risky for network consumers or anything that can fall behind indefinitely.

For UI streaming, the standard choice is a **bounded buffer with jitter tolerance**: a buffer of 10 to 50 events handles the bursty render patterns of a UI without risking memory exhaustion if the user's connection is slow. For supervisor agents observing a worker, blocking is usually fine because the supervisor processes every event anyway, and falling behind would mean missing critical events.

```python
# Bounded buffer wrapper for a slow consumer
async function consume_with_buffer(event_stream, buffer_size):
  buffer = AsyncQueue(maxsize=buffer_size)

  async function fill_buffer():
    async for event in event_stream:
      await buffer.put(event)    # blocks if buffer is full
    await buffer.put(DONE)

  spawn_background(fill_buffer)

  while True:
    event = await buffer.get()
    if event is DONE:
      return
    yield event
```

## Event Priority and Scheduling

Not all events are equal. In a terminal UI, a keystroke must feel instant. It maps to a discrete user action where any perceptible delay feels broken. A window resize can tolerate a frame of delay because the user doesn't feel a 16ms lag in layout reflow. If the system treats all events with equal urgency, resize events compete with keystrokes for the same dispatch slot, and the result is visible input lag under load.

The solution is a priority model with three classes:

**Discrete events** (`keydown`, `keyup`, `click`, `focus`, `blur`, `paste`): dispatched synchronously at the highest priority. These cannot be batched. The user expects an immediate response: a character appears, a button activates, focus shifts. Any delay above ~50ms is perceptible.

**Continuous events** (`resize`, `scroll`, `mousemove`): batched at lower priority. These fire at high frequency during user interaction (dozens per second on a resize drag), and the final state is what matters, not each intermediate value. Batching them absorbs the burst and reduces rendering load.

**Default events**: everything else. Normal scheduling, no special batching or urgency.

```python
type EventPriority = "discrete" | "continuous" | "default"

function get_event_priority(event_type: str) -> EventPriority:
  match event_type:
    "keydown" | "keyup" | "click" | "focus" | "blur" | "paste":
      return "discrete"    # sync, cannot be batched: user expects instant response
    "resize" | "scroll" | "mousemove":
      return "continuous"  # batched: high frequency, tolerate slight delay
    _:
      return "default"     # normal scheduling

function schedule_event(event: TerminalEvent) -> void:
  priority = get_event_priority(event.type)
  match priority:
    "discrete":   dispatch_sync(event)        # runs immediately, no queuing
    "continuous": enqueue_for_batch(event)    # coalesced with other continuous events
    "default":    enqueue_normal(event)       # standard scheduler queue
```

This priority mapping is what makes keystrokes feel instant while resize events are batched. It is not a performance optimization you can defer to later. Without it, a burst of resize events during active typing creates visible input lag that users notice immediately. The priority model is a correctness requirement for interactive terminal applications, not a nicety.

> **Note:** This is the terminal input event system (keyboard, mouse, resize). It is completely separate from the agent event stream (TextDelta, ToolDispatch, Complete). The two systems serve different purposes and must not be conflated.

## Capture and Bubble Phases

The terminal event system implements the same two-phase dispatch model that web developers know from the browser DOM. Understanding it is important for any agent UI that uses component trees: modal dialogs, overlapping panels, nested input widgets.

When an event is dispatched to a target node, it travels in two phases:

1. **Capture phase**: the event walks down from the root of the component tree toward the target node. Each ancestor has the opportunity to intercept the event before it reaches the target.
2. **Bubble phase**: after the target handles the event, it walks back up toward the root. Each ancestor has the opportunity to react after the target has processed it.

This two-phase model enables **event delegation**: a parent component can intercept events intended for its children. A modal dialog can capture all keyboard events during capture phase and prevent them from reaching underlying components. A keyboard shortcut handler at the root can intercept `ctrl+c` before any child sees it.

```python
# Dispatch an event to a target through the component tree
function dispatch(target: Node, event: Event) -> void:
  # Build ordered listener list: capture listeners root to target, then bubble listeners target to root
  listeners = collect_listeners_capture_to_bubble(root, target, event.type)

  for listener in listeners:
    if event.is_propagation_stopped:
      break
    listener.handle(event)

# A listener claiming an event stops ALL remaining listeners, not just bubbling
function handle_keyboard_shortcut(event: KeyEvent) -> void:
  if event.key == "ctrl+c":
    abort_current_operation()
    event.stop_immediate_propagation()   # no other handler sees this event
```

**`stopImmediatePropagation()` is stronger than `stopPropagation()`.** Standard DOM has both: `stopPropagation()` prevents further bubbling but still calls remaining listeners at the current node. `stopImmediatePropagation()` halts ALL listeners for this event, including others registered at the same node. The terminal event system exposes only `stopImmediatePropagation()`. This is a stronger guarantee: once a handler claims an event, no other handler sees it at all. Design handlers with this in mind. Claiming an event is an exclusive action.

The web analogy is intentional and makes the pattern portable. A developer who has used `addEventListener` with the capture flag understands this model immediately. The same mental model applies in terminal UI component trees.

## The Output Rendering Model

There is a common misconception about how terminal output works in agent UIs: that the agent streams text directly to the terminal, that each `TextDelta` event causes a `write()` call and characters appear immediately. This is not how production terminal UIs work.

The actual model is closer to React's virtual DOM than to a streaming write. Here is the sequence:

1. `TextDelta` events arrive from the agent loop and update an in-memory document model.
2. When a render tick occurs (frame rate controlled, not token arrival rate), the document renders to a **screen buffer**: a two-dimensional array of character cells, each with a precomputed style, width, and color.
3. The screen buffer is **diffed** against the previous frame. Only cells that changed are emitted as terminal escape codes. Cells that haven't changed produce no output.
4. The diff result is written to the terminal as a compact sequence of escape codes: cursor moves, color changes, character writes, and nothing more.

```python
# Screen buffer and diff model
type ScreenCell = {
  character: str
  width: int         # precomputed: 1 for ASCII, 2 for wide chars (CJK, emoji)
  style_id: int      # index into style table, avoids repeating full style spec
  hyperlink: str?    # optional hyperlink URL
}

type ScreenBuffer = list[list[ScreenCell]]   # rows x columns

function render_frame(document: Document, previous: ScreenBuffer) -> str:
  current = render_to_buffer(document)       # full layout pass
  diff = compute_diff(previous, current)     # only changed cells
  return emit_escape_codes(diff)             # cursor moves + character writes
```

The consequences of this model are non-obvious:

**Output latency is frame-rate-limited, not token-rate-limited.** Even if the LLM produces 80 tokens per second, the terminal renders at N frames per second (often 30-60). Characters that arrive between frames are batched into one render. Users see smooth, controlled output, not one character per token write. The frame rate is the throttle, not the LLM speed.

**Character widths must be precomputed per cell.** `'hello'.length === 5` is correct, but `'$B$3$K$K$A(B'.length === 5` while the visual width is 10 columns. CJK characters and emoji occupy 2 columns. Layout algorithms that use string character count for column positions will corrupt the display for any non-ASCII content. The correct approach is grapheme segmentation: count grapheme clusters, not Unicode code points, and look up display width per cluster. Width is cached in the cell (`ScreenCell.width`) so it's computed once per unique character, not per render frame.

**Screen diffing prevents partial-render corruption during resize.** If the terminal is resized mid-render and text was being written directly, the lines written before the resize use the old column width and the lines after use the new width. The display is torn. With screen buffering, the resize triggers a full re-render of the buffer at the new dimensions, and the diff produces the complete correct frame. No tearing, no partial-render artifacts.

## The Generator Connection

The agent loop's generator pattern (yield the response, check for tool calls, loop) is the source of events. [Agent Loop Architecture](/docs/agent-loop) covers why the loop uses an async generator for streaming intermediate turns: generators compose naturally, suspend at each yield, and let callers observe without coupling. This page covers the other half: what consumers do with those events, and how the pipeline handles load when consumers can't keep up. The two pages are complementary. The agent loop is the producer side, and this page is the consumer side and the pipeline between them.

## Production Considerations

**Screen diffing means terminal output is frame-rate-limited, not token-rate-limited.** An LLM producing 80 tokens per second does not mean 80 writes per second to the terminal. The render loop fires at a controlled frame rate. Tokens arriving between frames are batched into one render. Users experience smooth, uniform output regardless of the LLM's per-token generation speed. This is a feature, not a limitation. Direct per-token writes at 80/second would produce visible flicker.

**Disabling the listener limit is load-bearing, not a hack.** Most event emitter implementations warn when more than 10 listeners attach to a single event. This heuristic is designed to catch memory leaks. In agent UIs with component trees, many independent components legitimately subscribe to the same keyboard event source. The default limit triggers false warnings that pollute the terminal. The correct response is to remove the limit, but this removes the safety net for real memory leaks. To compensate, component cleanup (unlisten on unmount, teardown on component removal) must be rigorous. The listener limit removal is correct for this use case, but it trades one safety mechanism for a discipline requirement.

**Precomputing character widths is not optional for international users.** Layout that uses `string.length` for column calculations will produce corrupted displays for any content containing CJK characters, emoji, or other wide Unicode. The correct approach (grapheme segmentation with per-cluster width lookup) must be applied at the cell level, not at the string level. Do it once per unique character (cache the result in the cell), not per render frame. This is a correctness requirement, not a performance optimization: without it, the display is wrong for a significant fraction of users.

**Event priority is a correctness requirement, not a performance optimization.** Without priority dispatch, a burst of resize events (dozens per second during a window drag) competes directly with keystrokes for dispatch time. Keystrokes that arrive during a resize burst experience visible input lag. The priority model (synchronous dispatch for discrete input events, batched dispatch for continuous events) is the mechanism that keeps the UI responsive under realistic load. This is not something you can add later once you notice the lag. The architectural separation needs to be there from the start.

**`stopImmediatePropagation()` is an exclusive claim. Design for it.** When a handler calls `stopImmediatePropagation()`, no other handler (including handlers at the same component) sees that event. This is the only propagation control the system exposes (there is no `stopPropagation()` for partial halt). Handlers that claim events must be designed with this exclusivity in mind: if two handlers at different components both want to handle the same key, the one that captures it first wins completely. Priority and registration order determine who gets the event. Document this in your component contract.

**The two event systems must be kept architecturally separate.** The terminal input event system (keyboard, resize, scroll) and the agent event stream (TextDelta, ToolDispatch, Complete) are separate buses with different semantics. The terminal input system is synchronous, priority-dispatched, capture/bubble-routed. The agent event stream is asynchronous, yielded from a generator, consumed via async iteration. Mixing them (subscribing to keyboard events expecting TextDelta, or routing ToolDispatch through the terminal dispatcher) produces subtle bugs where events reach the wrong handlers or are dispatched at the wrong priority. Name them explicitly in your codebase.

## Best Practices

**Do: use discriminated unions for all event types.** Don't: use string event names or untyped callbacks. Discriminated unions give you exhaustive pattern matching. A consumer that handles `TextDelta` and `Complete` but not `ToolDispatch` is a compile-time warning, not a silent miss at runtime.

**Do: use bounded buffers for UI streaming.** Don't: use unbounded buffers for any consumer that can fall behind (network consumers, slow renderers). A bounded buffer is an explicit commitment about how far the producer can run ahead.

**Do: precompute character widths via grapheme segmentation.** Don't: use `string.length` for layout calculations. Width is a display property of the rendered character, not a count of Unicode code points.

**Do: implement frame-based rendering with screen diffing.** Don't: stream individual characters to the terminal on each token arrival. Screen diffing prevents partial-render corruption, handles resize correctly, and gives you control over output latency.

**Do: separate the terminal input event system from the agent event stream.** Don't: conflate the two systems. They have different dispatch models, different scheduling semantics, and different consumers. Keep them named, typed, and routed separately.

**Do: use event priorities (discrete for input, continuous for resize/scroll).** Don't: treat all terminal events with equal urgency. The priority model is the mechanism that keeps input responsive under load.

**Do: register response handlers before sending messages that expect a response.** Don't: send then register. That's a race condition where the response arrives before the handler is ready.

## Related

- **[Agent Loop Architecture](/docs/agent-loop)**: The generator pattern that produces agent events. The agent loop is the producer side of the streaming pipeline. It yields events as the LLM responds and tools execute. This page and agent-loop.md cover complementary halves: why the generator exists versus what consumers do with what it yields.

- **[Multi-Agent Coordination](/docs/multi-agent-coordination)**: Event propagation through the coordination layer. When a supervisor observes worker agents, it subscribes to their event streams the same way a UI subscribes to the agent loop. Understanding the event model here makes multi-agent observability composable rather than bespoke per-system.

- **[Safety and Permissions](/docs/safety-and-permissions)**: Event isolation as a security boundary. The terminal input event system handles raw user keystrokes, including permission grant/deny responses. Understanding how capture/bubble dispatch routes events through component trees is relevant to how permission dialogs intercept input without affecting underlying components.

- **[Tool System Design](/docs/tool-system)**: Tools generate events as they execute. Tool dispatch events, tool result events, and abort events flow through the streaming pipeline. The concurrent dispatch of read-only tools creates interleaved event streams that the consumer must handle correctly.

- **[Observability and Debugging](/docs/observability-and-debugging)**: The streaming event pipeline is the primary data source for observability. Event logging, cost tracking, and session tracing all consume the same typed event stream. Understanding the event model here makes the observability layer's span hierarchy and cost attribution directly interpretable.

- **[Pattern Index](/docs/pattern-index)**: All patterns from this page in one searchable list, with context tags and links back to the originating section.

- **[Glossary](/docs/glossary)**: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.


---

# Multi-Agent Coordination

Source: https://claudepedia.dev/docs/multi-agent-coordination
Section: Patterns / core-systems

Why production multi-agent systems use delegation (not distribution): one coordinator decides what to do, specialist workers decide how. Covers spawning backends, file-based mailbox communication, session reconnection, and production trade-offs.

Some tasks are too complex for a single agent. A research task might need five sources investigated simultaneously. A coding task might need file search, code generation, and test execution to happen in parallel. The instinct is to build agents as equal peers that split the work: a flat team where each agent takes a slice. That model fails in practice.

The failure mode is predictable: without a coordinator, agents duplicate work, produce conflicting outputs, and leave no one responsible for assembling a coherent answer. You end up with five partial answers that the user has to reconcile themselves, or (worse) five agents calling each other in circles. Production multi-agent systems don't look like a group of equals. They look like a team with a lead.

**Delegation, not distribution.** The coordinator decides WHAT, workers decide HOW. Synthesis happens at the coordinator.

The coordinator agent receives the user's task, sees the full picture, and breaks it into subtasks. It does not execute those subtasks directly. It delegates them to worker agents. Workers each receive a narrow, well-defined task with the tools and context they need to complete it. The coordinator waits for results, then synthesizes them into a coherent final answer.

This is not a router pattern. A router sends the task to the most relevant worker and passes the worker's output back to the user unchanged. A coordinator synthesizes: it takes multiple partial results, resolves any conflicts, fills any gaps, and produces an integrated answer. The synthesis step is an LLM call, not a concatenation.

```mermaid
sequenceDiagram
    participant U as User
    participant C as Coordinator
    participant W1 as Worker 1
    participant W2 as Worker 2

    U->>C: Task
    C->>C: Plan subtasks
    par Parallel dispatch
        C->>W1: Subtask A (with isolated context + tools)
        C->>W2: Subtask B (with isolated context + tools)
    end
    W1-->>C: Result A
    W2-->>C: Result B
    C->>C: Synthesize (LLM call)
    C-->>U: Integrated answer
```

The diagram shows the key structural properties: parallel dispatch, isolated execution, and synthesis at the coordinator. The coordinator never touches the user again until synthesis is complete.

## The Delegation Pattern

Here is the coordinator loop in pseudocode:

```python
function coordinator_loop(task, available_workers):
  # Coordinator plans: it does not execute subtasks directly
  plan = await llm.plan(task)     # returns list of subtasks with tool and context requirements

  # Dispatch to workers in parallel: each gets isolated context and tools
  worker_tasks = []
  for subtask in plan.subtasks:
    worker = spawn_worker(
      task=subtask,
      context=subtask.required_context,  # only what this worker needs, not the full history
      tools=subtask.required_tools,      # only the tools for this domain
    )
    worker_tasks.append(worker)

  # Gather results (parallel execution: coordinator waits for all)
  worker_results = await gather_all(worker_tasks)

  # Coordinator synthesizes: never relays raw results
  return await llm.synthesize(
    original_task=task,
    worker_results=worker_results,
  )
```

Three design decisions embedded in this structure are worth making explicit.

**Tool partitioning.** The coordinator gets only coordination tools: spawn a worker, send a message, stop. It does not get file system access, API access, or search tools. Workers get the domain tools for their specific subtask. This partition prevents the coordinator from bypassing workers and doing the work directly, a failure mode that collapses the delegation model back to a single agent. If the coordinator can search files, it will search files instead of delegating, and you lose all the parallelism and specialization you designed for.

**Synthesis, not relay.** The coordinator doesn't pass raw worker output to the user. After all workers return, it makes another LLM call to synthesize, combining results, resolving conflicts, filling gaps, and producing an integrated answer. The synthesis step is what distinguishes a coordinator from a router. Skip it and you've built an expensive router that makes the user do the assembly work. Include it and the user gets one coherent answer regardless of how many workers contributed to it.

**Parallel execution.** Workers run concurrently via async gather. The coordinator dispatches all workers and waits for all to finish (or handles partial failure, as discussed below). The parallelism is the primary cost justification for multi-agent architecture. If your workers run sequentially, you've added coordination overhead without adding throughput.

## Context Isolation

Each worker agent starts with a **fresh message history**, either empty or initialized with only the context it needs for its specific subtask. Workers cannot read each other's message histories. The coordinator sees only what workers explicitly return, not their full internal conversation.

This isolation is a design choice, not a limitation. It has four properties that make multi-agent systems tractable at scale:

**Prevents error cascades.** A worker that gets confused by a malformed tool result, encounters unexpected data, or goes down a wrong reasoning path cannot infect other workers. Its error is contained to its own context window. The coordinator sees a failed result and can handle it (retry, skip, escalate) without the error spreading.

**Enables parallel execution.** Workers share no mutable state. There is no shared message history to lock, no coordination overhead between workers, no race condition on who appends to the conversation first. Isolation is what makes the `gather_all` in the coordinator safe to parallelize without synchronization.

**Makes debugging tractable.** Each worker's conversation is a self-contained artifact. When something goes wrong, you can read a single worker's message history in isolation and understand exactly what it saw, what it concluded, and why. Without isolation, debugging a failure means untangling one long conversation where multiple agents interleaved their reasoning.

**Security isolation.** A worker that processes untrusted input (user-uploaded documents, external API responses, scraped web content) cannot leak that content into the coordinator's decision-making or into other workers. The worker's context is quarantined. If the untrusted content attempts a prompt injection attack, it affects only that worker's narrow task, not the coordinator's plan or other workers' execution.

## Spawning Strategies

How the coordinator dispatches workers depends on task structure and tolerance for partial failure.

**Synchronous spawning.** The coordinator spawns one worker, waits for it to finish, then spawns the next. Simple and easy to reason about, but loses all parallelism. Use when tasks are sequential (each subtask depends on the previous worker's output) or when you're debugging and want to inspect each worker's result before proceeding.

**Async gather.** All workers are spawned at once, and the coordinator waits for all to finish before synthesizing. Parallel, and the right default for independent subtasks. The cost: you must decide what to do when one worker fails. Cancel the remaining workers and fail the whole task? Wait for successful workers and synthesize from partial results? That decision should be explicit in your coordinator design, not left as an implicit crash.

**Async with progressive synthesis.** The coordinator processes results as they arrive, decides when it has enough to synthesize, and optionally cancels remaining workers. Best throughput for tasks where partial results are useful (research tasks where 3 of 5 sources are sufficient), but most complex to implement. Requires the coordinator to make an explicit "do I have enough?" decision at each result.

## Spawning Backends

Workers don't run in just one environment. A production coordination system needs to work whether the agent is running in a script, a desktop terminal with multiplexer support, or a native split-pane terminal. Three runtime environments exist, all sharing the same executor interface so the coordinator doesn't need to know which one it's using.

**In-process executor.** The worker runs in the same process as the coordinator, isolated via async-local storage rather than process boundaries. The worker shares the coordinator's API client and MCP connections (no startup cost for re-establishing those) but has fully isolated message history. Use when external dependencies are unavailable or when you want minimal spawning overhead.

**Multiplexer-based executor.** The worker runs in a separate process, launched inside a terminal multiplexer pane (color-coded borders help distinguish agents visually). The new process is fully independent, so crashes don't affect the coordinator process. The cost is the time of a new process startup and MCP reconnection.

**Native terminal executor.** The worker runs in a separate process, launched in a native terminal split pane. Functionally equivalent to the multiplexer-based executor from a coordination perspective (same process isolation, same communication model), but uses the terminal's native split pane API rather than the multiplexer command interface.

The system selects the executor at runtime without any configuration in the coordinator:

```python
type TeammateExecutor = {
  spawn(config: SpawnConfig) -> SpawnResult
  send_message(agent_id: str, message: Message) -> void
  terminate(agent_id: str, reason?: str) -> bool
  is_active(agent_id: str) -> bool
}

# Backend selection at runtime: same interface regardless of backend
function get_executor() -> TeammateExecutor:
  if inside_native_terminal() and native_pane_available():
    return native_terminal_executor()
  if inside_multiplexer() or multiplexer_available():
    return multiplexer_executor()
  return in_process_executor()   # always available, no external dependencies
```

The key insight: in-process workers share resources (API client, MCP connections) but have isolated message history. The isolation is at the conversation level, not the resource level. This means in-process workers don't pay re-connection costs, but they also can't contaminate each other's reasoning. For fast, resource-light scenarios the in-process executor is the right default. For long-running, crash-tolerant workers the process-isolated executors are better.

## Mailbox Communication

All three executor backends use the same communication channel: a file-based mailbox. Not shared memory. Not pipes. Not sockets. Files, even for in-process workers.

This design choice might seem surprising. Why add file I/O when in-process workers could use a channel directly? The answer is uniformity: the coordination code that sends a message to a worker doesn't need to know whether that worker is in-process or running in a separate terminal. The same send call works for all three. It also means message history is inspectable on disk at any moment, a trivial but decisive debugging advantage.

The mailbox is a directory of message files. Each file represents one message. Reading is atomic: the reader acquires a lockfile, reads the message file, deletes it, and releases the lock. This is not a queue. There is no ordering guarantee beyond filesystem modification time, and messages are consumed exactly once.

```python
# Agent ID format: agentName@teamName
# Example: researcher@my-team, tester@my-team
# Deterministic, human-readable, and grep-able in logs

function send_message(to: str, message: Message) -> void:
  # to is an agent ID: "researcher@my-team"
  mailbox_dir = get_mailbox_path(to)
  message_file = mailbox_dir / f"{uuid()}.msg"
  message_file.write_atomic(message.serialize())

function receive_messages(agent_id: str) -> list[Message]:
  mailbox_dir = get_mailbox_path(agent_id)
  messages = []
  for msg_file in mailbox_dir.list_files():
    with lockfile(msg_file):
      if msg_file.exists():       # check again under lock
        messages.append(Message.deserialize(msg_file.read()))
        msg_file.delete()
  return messages
```

The agent ID format (`agentName@teamName`) is deliberate. It's deterministic and human-readable. `researcher@my-team` is immediately understandable in a log file. A UUID would be correct but opaque. When debugging a multi-agent system, being able to grep for `researcher@my-team` and see everything that agent sent and received is the difference between a 5-minute and a 50-minute debugging session.

The trade-off: file I/O introduces latency. A coordinator polling at 1-second intervals will have 0 to 1 second message delay. That's acceptable for multi-step coordination tasks (research, code generation, analysis) but makes the mailbox pattern unsuitable for tight feedback loops where sub-100ms coordination is required.

**One ordering rule matters:** register your message callback before sending the message. If you send first and then register, there is a race condition. The sender may respond before the receiver has registered, and the response will be missed.

## Session Reconnection

Workers can crash. Processes get killed. Users close terminals. A robust multi-agent system handles this without losing team membership.

Workers persist their team identity in the session transcript: their agent name, team name, and the ID of the leader agent. On session resume, the system reads the team file on disk, recovers the leader's identity, and re-registers the agent. A crashed worker that resumes from its transcript rejoins the team without any manual intervention and without the coordinator having to reschedule its subtask.

```python
function initialize_worker_from_session(session: Session) -> TeamContext:
  # Worker stores team membership in its own session transcript
  team_entry = session.find_entry(type="team-membership")
  if team_entry is None:
    return None   # not a team worker

  # Recover team state from the team file (coordinator writes this)
  team_file = get_team_file(team_entry.team_name)
  team_state = team_file.read()

  # Re-register with the team
  register_agent(
    agent_id=team_entry.agent_id,     # "researcher@my-team"
    leader_id=team_state.leader_id,   # recovered from team file
  )

  return TeamContext(
    agent_id=team_entry.agent_id,
    leader_id=team_state.leader_id,
    team_name=team_entry.team_name,
  )
```

The team file is the source of truth for who the leader is. Workers don't store the leader's ID only in memory. They write it to the team file so it survives restarts. The session transcript records team membership so workers know they're a team member when they resume. Together, these two persistence mechanisms make team membership durable across crashes.

## Error Recovery in Multi-Agent Systems

When a worker fails, the coordinator has the same escalation options as any agent facing a tool failure: retry the worker, fall back to a different worker or strategy, degrade by proceeding without that worker's contribution, or fail the entire task. The tiered recovery ladder (retry, fallback, degrade, fail) applies at the coordination level, not just at the individual tool level. A coordinator that swallows worker failures silently will produce confident-sounding but incomplete synthesis. Make the failure handling explicit. See [Error Recovery and Resilience](/docs/error-recovery) for the full escalation ladder pattern.

## Streaming Through the Coordination Layer

The event pipeline from [Streaming and Events](/docs/streaming-and-events) extends naturally through multi-agent systems. A supervisor can observe worker event streams in real-time, subscribing to a worker's event stream the same way a UI subscribes to the agent loop. The coordinator can forward relevant events (tool dispatches, intermediate text) to the user's UI before synthesis is complete, enabling progressive disclosure even in multi-agent workflows. The event model makes this composable: no coupling between the coordinator's forwarding logic and the specific consumers watching the stream.

## Production Considerations

**In-process workers share resources but have isolated conversation state. The isolation boundary matters.** When a worker is launched in-process, it shares the parent's API client and MCP connections. It does not pay the cost of re-establishing those connections. But it has its own message history. The coordinator cannot read it, and other workers cannot read it. This means in-process workers are fast to start but not fully isolated from resource exhaustion: if one worker drives up API usage, the shared client's rate limits affect all workers. Separate-process workers have independent clients and don't share rate limit state.

**File-based mailboxes make debugging trivial but introduce latency.** A mailbox is a directory of message files that requires disk I/O to read. A coordinator polling at 1-second intervals will have 0 to 1 second message latency regardless of local execution speed. This is acceptable for multi-step coordination tasks (most coordination decisions aren't time-critical), but it makes the mailbox model unsuitable for tight feedback loops requiring sub-100ms response times.

**The leader's permission UI shows a pending indicator, but the worker's execution is paused.** When a worker needs permission for a tool and forwards the request to the leader via mailbox, the worker's execution is blocked at that point. The leader sees a "pending worker request" indicator in its UI and can approve or deny. If the leader's session ends before responding (the user closes the terminal), the worker's abort signal fires and it resolves with a cancel decision, preventing a hung worker. This graceful abort is the mechanism that keeps workers from blocking indefinitely when the leader disappears.

**Agent IDs should be human-readable and deterministic.** Using `researcher@my-team` instead of a UUID makes log correlation trivial. When debugging a multi-agent workflow, being able to grep for a specific agent name across log files is far faster than correlating UUID fragments. The `name@team` format encodes both the agent's role and its team membership in a single inspectable string.

**Register callbacks before sending mailbox messages, not after.** If a coordinator sends a message and then registers a handler for the response, there is a window where the response could arrive before the handler is registered. The handler misses it. The coordinator waits indefinitely. Register first, send second. Always.

**Synthesis at the coordinator requires sufficient context.** When you use async gather with progressive synthesis (process results as they arrive), the coordinator must have enough context from each worker's result to synthesize coherently. If workers return only partial outputs ("here's part of what I found" without the full artifact), synthesis quality degrades. Design worker return contracts explicitly: what minimum information must a worker return for the coordinator to synthesize without that worker's full context?

**Executor backend selection is transparent to the coordinator, but failure modes differ.** An in-process worker that panics can corrupt shared process state. A separate-process worker that crashes leaves the coordinator's process entirely unaffected. When you're spawning workers that run untrusted or unpredictable code, the process boundary offered by multiplexer-based or native terminal executors is a real safety property, not just a UI feature. For trusted, predictable subtasks, the in-process executor's lower latency is the right choice.

## Best Practices

**Do: use tool partitioning to prevent coordinator bypass.** Give the coordinator only coordination tools (spawn, send, stop). Don't give it the domain tools it delegates. A coordinator with file access will use file access directly instead of delegating.

**Don't: let the coordinator relay raw worker results to the user.** The coordinator synthesizes. It makes an LLM call to combine, resolve conflicts, and produce an integrated answer. If you pass worker outputs directly to the user, you've built an expensive router that makes the user do the assembly work.

**Do: use file-based mailboxes for inter-agent communication.** File-based mailboxes work across all executor types (in-process, multiplexer, native terminal) with the same interface. They make message history inspectable on disk. Don't: use shared memory, queues, or pipes. They fail silently across process boundaries.

**Do: use human-readable agent IDs in `name@team` format.** Don't: use UUIDs or auto-generated identifiers. Debugging a multi-agent system with opaque agent IDs is significantly harder than with human-readable ones.

**Do: register response callbacks before sending mailbox messages.** Don't: send first, then register. That's a race condition waiting for production load to trigger it.

**Do: make partial failure handling explicit.** Decide before deployment whether a coordinator should cancel all workers on one failure, synthesize from partial results, or wait and retry. Don't: let partial failure handling be an implicit crash that produces a silent empty synthesis.

**Do: persist team membership (agent ID, team name, leader ID) in the session transcript.** Don't: store team identity only in memory. A restarted worker needs to rejoin without manual intervention, and the transcript is the durable record.

**Do: prefer the in-process executor for fast, trusted subtasks. Prefer process-isolated executors for long-running or untrusted workers.** The in-process executor's shared API client and zero startup latency make it ideal when spawning overhead matters. Process isolation's crash containment and independent resource limits make it the right choice when workers run unpredictable or long-running operations. Choose deliberately. Don't let the default decide for you.

## Related

- **[Agent Loop Architecture](/docs/agent-loop)**: The single-agent loop that each worker runs. Every worker in a multi-agent system is itself an agent loop, the same two-state machine, with its own message history and tool dispatch cycle. Understanding the loop makes worker execution predictable: a worker terminates when its model response contains no tool calls, just like any single agent.

- **[Streaming and Events](/docs/streaming-and-events)**: Event propagation through the coordination layer. Worker agents yield typed events the same way a single agent loop does. A coordinator or supervisor can subscribe to those event streams to observe worker progress in real-time, forward events to a UI, or detect failures before synthesis begins. The event model makes multi-agent observability composable.

- **[Safety and Permissions](/docs/safety-and-permissions)**: Worker permission forwarding and context isolation as a safety property. Workers running in isolated contexts cannot show user dialogs directly. They forward permission requests to the leader via the mailbox protocol. The permissions chapter covers how to configure per-worker trust levels and how the mailbox-based forwarding protocol protects against unattended workers prompting users from the wrong context.

- **[Hooks and Extensions](/docs/hooks-and-extensions)**: Team lifecycle hooks (`SessionStart`, `SessionStop`, and the 27+ hook events organized by lifecycle phase) give coordinators and workers extensibility without modifying the core coordination protocol. Hook-based audit and observability compose naturally with the mailbox communication pattern.

- **[Pattern Index](/docs/pattern-index)**: All patterns from this page in one searchable list, with context tags and links back to the originating section.

- **[Glossary](/docs/glossary)**: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.


---

# Command and Plugin Systems

Source: https://claudepedia.dev/docs/command-and-plugin-systems
Section: Patterns / advanced-patterns

How to build a scalable command registry that supports 100+ commands through metadata-first registration, lazy loading, and multi-source merging.

An agent starts with a handful of commands. In production, it grows to 100+. If each command wires itself into the dispatch logic at registration time, you get a maintenance nightmare. Adding a command means touching the dispatcher, the help system, the permission layer, and the availability checks. Every command knows too much about the system it lives in.

The solution is to **separate declaration from implementation**. Every command is a metadata object first. The metadata tells the system what the command is, when it's available, and how to load it, but never what it does. Implementation is deferred to invocation time via dynamic imports. This is the metadata-first registration pattern, and it enables everything else: lazy loading, multi-source merging, and command discovery without code execution.

The big idea: the registry can hold 100+ commands while loading essentially none of them at startup. A command that is never invoked never loads its module. A command that is disabled never even reaches the dispatcher. And because every source (builtins, plugins, skills, external providers) contributes the same metadata shape, the registry merges them all with a simple array concatenation. No collision resolution logic. No priority scores. The ordering of the arrays IS the priority.

## The Three Command Types

The command type is a **discriminated union** on the `type` field. The dispatcher needs to know how to run each type without knowing what any specific command does:

```python
type Command = BaseMetadata & (
  | { type: "local", load: () -> Promise<{ call: LocalCommandFn }> }
  | { type: "interactive", load: () -> Promise<{ call: InteractiveCommandFn }> }
  | { type: "prompt", get_prompt: (args, context) -> Promise<ContentBlock[]> }
)
```

The three types cover all behaviors:

| Type | What It Does | Example |
|------|-------------|---------|
| `local` | Runs a function, returns a result (text, structured data, or skip). No UI. | `/summarize`, `/compact` |
| `interactive` | Renders a UI component, receives a completion callback. For commands that need user interaction. | `/settings`, `/agents` |
| `prompt` | Expands into the conversation as content. The command produces messages, not side effects. | Skills, workflow commands, plugin-provided commands |

Why three types instead of one? Because the dispatcher needs to route correctly. A `local` command is called and its return value is processed. An `interactive` command is rendered and the system waits for the `on_done` callback. A `prompt` command injects content blocks into the conversation and the agent loop continues. Each dispatch path is different, but none of the dispatch logic touches any specific command's implementation. The type field is the only coupling.

## Metadata-First Registration

Every command is described by a base metadata object before it is ever loaded. The following shows a complete registration. Notice that the module containing the implementation is not imported here:

```python
# Each command is a metadata object: behavior is declared, not wired
command = {
  type: "local",            # one of: local, interactive, prompt
  name: "summarize",
  description: "Summarize the current conversation",
  is_enabled: () -> config.get("summarize_enabled", default=True),
  argument_hint: "<optional focus area>",
  load: () -> import_module("commands.summarize")  # lazy, not called at registration
}

# Registry just holds the metadata objects
registry.add(command)
```

Each metadata field has a specific purpose in the dispatch lifecycle:

- **`is_enabled()`**: evaluated fresh every time the command list is requested. NOT memoized. Why: auth state can change mid-session (for example, after a login command grants new permissions). Memoizing this check means users don't see commands they've just unlocked, a silent failure with no error message.

- **`availability`**: a separate, static gate: who can ever use this command (auth requirements, provider restrictions). Evaluated independently from `is_enabled` to keep "who can use it" apart from "is it turned on right now." Conflating them into one check breaks post-login auth refreshes.

- **`load()`**: the only connection to the implementation. Returns a module with a `call()` function. This is the lazy loading hook. The module is never imported at registration time.

- **`argument_hint`**: what the help system shows. The help system works entirely from metadata. It never loads the command implementation to describe the command.

- **`description`**: what users see in the command picker. The LLM uses this field for command suggestions. Neither the user interface nor the LLM ever sees the implementation code.

The key insight: "The metadata describes the command completely enough for the registry, help system, permission checks, and availability filtering to work, without ever loading the command's code."

## Lazy Loading and Startup Cost

Every command implementation is loaded only when the command is invoked. The pattern is a two-phase call:

```python
# At registry time: just metadata, no code loaded
load: () -> import_module("commands.summarize")

# At invocation time: module loaded, function called
module = await command.load()
return module.call(args, context)
```

This matters enormously at scale. With 100+ commands, eager loading would import every command module at startup. Some commands pull in heavy dependencies. Consider a single command whose module is 113KB and 3,200 lines because it includes diff rendering and HTML generation. Loading that at startup costs 113KB of parse time, even for users who never invoke that command. With lazy loading, startup cost is O(1) regardless of how many commands exist: a 100-command registry loads 100 metadata objects, not 100 modules.

**The heavy-command shim pattern** goes one step further. For especially heavy commands, even the import wrapper can be deferred. The registry entry is a thin metadata object that constructs the dynamic import call only when invoked. The import path itself is not evaluated until dispatch:

```python
# Thin shim: the import path is not evaluated until load() is called
heavy_command = {
  type: "local",
  name: "analyze",
  description: "Analyze the codebase",
  is_enabled: () -> feature_enabled("INSIGHTS"),
  load: () -> import_module("commands.analyze")  # path string only, not executed
}
```

This avoids even the overhead of evaluating the import path at registry build time, which is useful when registry construction is on the critical path to first response.

## Multi-Source Registry Merge

The registry is not a static list. In a production agent, commands come from multiple sources: built-in commands shipped with the system, skills discovered from project directories, plugin commands installed by users, and workflow commands from project configuration files. All four sources must coexist with identical interfaces and identical dispatch paths.

The multi-source merge loads all sources concurrently and concatenates them:

```python
async function get_commands(working_dir: str) -> list[Command]:
  # All sources load concurrently
  [builtin_skills, plugin_skills, plugin_commands, workflow_commands] = await parallel([
    load_skills(working_dir),
    load_plugin_skills(),
    load_plugin_commands(),
    load_workflow_commands(working_dir),
  ])

  all_commands = [
    ...builtin_skills,
    ...plugin_skills,
    ...plugin_commands,
    ...workflow_commands,
    ...BUILTIN_COMMANDS,   # static list, always present
  ]

  # Run availability + enabled checks fresh: never memoize these
  return [c for c in all_commands if meets_availability(c) and c.is_enabled()]
```

The five sources and their ordering:
1. **Built-in skills** (shipped with the system): always present
2. **Plugin skills** (user-installed skill directories): project-local
3. **Plugin commands** (code-backed, from user-installed plugins): user-global
4. **Workflow commands** (file-backed, from project config): project-local
5. **Built-in commands** (the static core list): always present

**Ordering IS the priority.** A plugin skill with the same name as a built-in command shadows it because it appears earlier in the array. There is no collision resolution logic, no priority score calculation. The array merge is intentionally simple. Simplicity here prevents a class of bugs where two sources both claim a command name and the winner is non-obvious.

**Memoization by working directory.** The expensive parts (skill discovery, plugin loading) are memoized by `working_dir` because skills are project-local. Two projects can't share a command cache. A command discovered in `/project-a/.claude/skills/` is not available in project B. But the availability and `is_enabled` checks run fresh on every `get_commands()` call because those checks are cheap and auth state can change mid-session.

## Feature Flags and Dead-Code Elimination

Some commands only exist when a feature flag is active. The temptation is to use dynamic imports for these, but dynamic imports cannot be tree-shaken by bundlers:

```python
# Wrong: dynamic import, bundler cannot tree-shake this
# The module is included in every build regardless of flag state
if feature_enabled("EXPERIMENTAL_INSIGHTS"):
  module = await import_module("commands/insights")
  commands.append(module.default)
```

```python
# Correct: conditional require, bundler can tree-shake when flag is off
# When the flag is off at build time, the module is excluded entirely
if feature_enabled("EXPERIMENTAL_INSIGHTS"):
  commands.append(require("commands/insights"))

# The command list uses filter to handle null entries
commands = [c for c in raw_commands if c is not None]
```

The reason this matters: a dynamic import path is a string that the bundler cannot evaluate at build time. The bundler must include the module in every build because the import *might* execute. A static `require()` call (or a static `import` at the top of the file) is analyzable. If the condition is false at build time, the bundler eliminates the module entirely.

This is a subtle but important production decision. Using dynamic imports for feature-flagged commands means every production build carries the dead code of every disabled experimental feature.

## Production Considerations

**1. Never memoize availability checks.**

The `is_enabled()` function runs fresh on every command list request because auth state changes mid-session. Consider a login command: after it succeeds, the system must show new commands that are gated on authenticated status. If `is_enabled()` is memoized, users don't see those commands until the next process restart. This is counterintuitive. The natural instinct is to cache for performance. The right split: memoize the expensive discovery (skill scanning, plugin loading) and keep the cheap checks (availability, enabled state) live.

**2. Separate "who can use it" from "is it turned on."**

Availability (a static auth/provider gate) and enabled state (a dynamic feature flag or environment check) are different concerns. A command can be available to a user but currently disabled (feature flag off). A command can be enabled but not available (requires auth the user hasn't completed). Conflating these into a single `is_available()` check makes it impossible for the system to distinguish these cases, and post-login auth refreshes will fail to surface newly-unlocked commands.

**3. Memoize by working directory, not globally.**

Skills are project-local. Different projects have different skill directories. A global command cache serves stale skills when switching projects. The memoization key must include the working directory. If your agent supports multiple concurrent projects, each project needs its own command cache entry.

**4. Heavy commands need lazy shims, not just lazy loading.**

For commands with very large implementations (tens of thousands of lines, multiple heavy dependencies), even the dynamic import wrapper should be deferred. Define a thin metadata object that constructs the import call only at invocation time. This avoids evaluating import paths at registry build time, which is relevant when registry construction is on a hot path.

**5. Static imports for feature-flagged commands.**

Using dynamic imports for feature-gated commands defeats dead-code elimination. The bundler can only tree-shake what it can statically analyze. Feature flags that are known at build time should use conditional `require()` or top-level static imports, not dynamic `import()` inside conditionals.

## Best Practices

- **Do** separate command metadata from command implementation. The registry should never import command code at registration time.
- **Do** run availability checks fresh on every request. Auth state changes mid-session.
- **Do** use discriminated unions for command types. The dispatcher needs to know how to run each type without knowing what it does.
- **Do** memoize expensive discovery (skill scanning, plugin loading) by working directory.
- **Don't** use a single boolean for "is this command available." Separate static availability from dynamic enabled state.
- **Don't** memoize the full command list including enabled checks. Only memoize the expensive parts.
- **Don't** use dynamic imports for feature-flagged commands. Use static imports that bundlers can tree-shake.
- **Don't** rely on ordering for correctness beyond precedence. If two commands with the same name should NOT shadow each other, that's a naming problem, not a registry problem.

## Related

- [Tool System](/docs/tool-system): Commands and tools are complementary: tools are what the LLM invokes autonomously, commands are what the user invokes explicitly. Both use registry patterns and metadata-first design.
- [MCP Integration](/docs/mcp-integration): MCP servers can contribute commands to the registry through the `prompt` command type, extending the command system across process boundaries.
- [Hooks and Extensions](/docs/hooks-and-extensions): Another extension mechanism that modifies agent behavior at defined extension points without touching core code. Hooks are event-driven while commands are user-invoked.
- [Pattern Index](/docs/pattern-index): All patterns from this page in one searchable list, with context tags and links back to the originating section.
- [Glossary](/docs/glossary): Definitions for domain terms used on this page.


---

# Hooks and Extension Points

Source: https://claudepedia.dev/docs/hooks-and-extensions
Section: Patterns / advanced-patterns

How typed interceptors modify agent behavior at 27+ lifecycle events without touching core loop code: four execution modes, condition syntax, and error isolation.

Hooks are the boundary between the agent core and everything else. If you want to validate a tool call before it runs, classify user input before it reaches the agent, audit a completed action, or trigger a webhook on every session start, you do all of that through hooks. The agent loop doesn't know any of this is happening. It calls hooks at defined extension points and continues.

Most systems start with a simple before/after model: run a function before the tool, run another function after. That covers 20% of real production needs. The other 80% requires extension points at session boundaries, memory compaction events, permission decisions, file system changes, and multi-agent team coordination. A before/after model forces you to either pack all of that into the tool hooks (wrong layering) or patch the core loop (unsafe coupling). The solution is a richer event taxonomy: 27+ named lifecycle events organized by phase, each independently hookable.

The mental model that makes this system work: **hooks are configured as data, not code**. You declare what kind of hook to run, which events to react to, and which conditions must be true. The agent loop interprets all of this. The hook runner evaluates conditions, spawns the right executor, and aggregates results. This separation keeps hooks safe: conditions are pattern matching, not arbitrary code evaluation. You can't accidentally inject logic into the evaluation path itself.

## The Four Execution Modes

Every hook declaration specifies one of four execution modes. The modes differ in cost, capability, and latency. Choosing the right mode is the primary design decision for any hook:

| Mode | Execution | Best For | Relative Cost |
|------|-----------|----------|---------------|
| `command` | Shell subprocess | Scripts, formatters, linters, validators | Low |
| `prompt` | Single LLM call (small/fast model) | Classification, content moderation, intent checks | Medium |
| `agent` | Full multi-turn sub-agent with tools | Complex verification that requires exploration | High |
| `http` | HTTP POST with SSRF protection | External webhooks, audit logs, third-party integrations | Low to Medium |

The mode is the cost/capability trade-off made explicit. A `command` hook runs fast and deterministically. It's the right choice for validation logic you can express in a script. A `prompt` hook makes one LLM call and returns a structured decision, making it right for classification that requires language understanding. An `agent` hook spawns a full sub-agent with access to tools. It can read files, run commands, and reason across multiple steps, but it's expensive. An `http` hook posts to an external endpoint and optionally waits for a response, making it right for audit systems and third-party integrations.

The hook configuration structure binds a mode to an event:

```python
# Hooks are configured as data: the runner interprets them
hooks_config:
  PreToolUse:
    - matcher: "Write"
      hooks:
        - type: "command"
          command: "lint-check --file $tool_input_path"
          if: "Write(src/**/*.ts)"
          timeout: 10

  Stop:
    - hooks:
        - type: "agent"
          prompt: "Verify that the implementation matches the requirements. Read the test output and check all tests pass."
          timeout: 60

  PostToolUse:
    - matcher: "Bash"
      hooks:
        - type: "http"
          url: "http://localhost:9000/audit"
          async: true    # fire-and-forget: don't block the agent
```

> **Note:** The `async: true` flag signals fire-and-forget. The hook delivers its result asynchronously. The agent doesn't wait. Use this for audit and logging hooks where you want observability without adding latency.

## The Hook Event Lifecycle

The 27+ hook events are organized by lifecycle phase. Understanding which phase covers your use case tells you which events to hook:

**Session lifecycle**: fires once per session, not per turn:
- `SessionStart`: agent session begins
- `SessionEnd`: agent session ends (clean or error)
- `Setup`: initialization phase before the first user interaction

**Per-turn**: fires around each agent turn:
- `UserPromptSubmit`: user submits a prompt, before it reaches the agent
- `Stop`: agent decides to stop (turn ends normally)
- `StopFailure`: agent hits a termination condition due to an error

**Tool lifecycle**: fires around every tool invocation:
- `PreToolUse`: before a tool executes. Can modify tool arguments or block execution.
- `PostToolUse`: after a successful tool execution. Can modify the tool output.
- `PostToolUseFailure`: after a tool execution that produced an error
- `PermissionRequest`: agent requests permission for a restricted operation
- `PermissionDenied`: permission was denied

**Memory**: fires around context compaction:
- `PreCompact`: before the context window is compacted
- `PostCompact`: after compaction completes

**Multi-agent team**: fires in multi-agent coordination scenarios:
- `SubagentStart`: a sub-agent is spawned
- `SubagentStop`: a sub-agent completes or is cancelled
- `TeammateIdle`: a coordinated agent has no pending work
- `TaskCreated`: a new task is created in the task queue
- `TaskCompleted`: a task completes

**Notifications:**
- `Notification`: the agent emits a notification event

**MCP integration:**
- `Elicitation`: an MCP server requests information from the user
- `ElicitationResult`: the result of an elicitation is available

**Configuration:**
- `ConfigChange`: agent configuration is reloaded or modified
- `InstructionsLoaded`: system instructions are loaded

**File system:**
- `CwdChanged`: working directory changes
- `FileChanged`: a tracked file is modified
- `WorktreeCreate`: a new git worktree is created
- `WorktreeRemove`: a git worktree is removed

The scope of this event set is the production insight: this is not a simple before/after tool hook system. It covers the entire agent operational surface, including session lifecycle, user turns, tool dispatch, memory management, multi-agent coordination, external integrations, and file system changes. Any of these events can be extended without modifying the core loop.

## Condition Syntax

Each hook can declare an `if` condition that gates when it fires. The condition syntax reuses the permission-rule pattern matching syntax, the same evaluator and the same mental model as the permission system.

```python
# Matches Write tool on any .ts file in src/
if: "Write(src/**/*.ts)"

# Matches any Bash command starting with 'git push'
if: "Bash(git push*)"

# No condition field = always run for this event
# (a hook with no if field matches every invocation)
```

The condition is evaluated against the tool name and the tool input before the hook executor is even spawned. A non-matching hook is never invoked, meaning no process, no LLM call, no HTTP request. This makes conditions a cost-saving mechanism, not just a filtering mechanism.

The practical implication: write the most specific condition you can. A `PreToolUse` hook with `if: "Write(src/**/*.ts)"` fires only for TypeScript source files. The same hook without an `if` condition fires for every Write call to every file, including build artifacts, logs, and temporary files, most of which you don't care about.

```python
# Example: condition syntax matching structure
hook_matcher:
  matcher: "Write"            # match on tool_name for PreToolUse/PostToolUse
  hooks:
    - type: "command"
      command: "check-types --file $tool_input_path"
      if: "Write(src/**/*.ts)"   # pattern: tool_name(tool_input_glob)
      timeout: 10
```

The matcher and the `if` condition work at two levels. The `matcher` field narrows which tool type triggers this hook group at all. The `if` condition within each hook further narrows by the specific tool input. Use `matcher` for broad categorization (all Write calls) and `if` for fine-grained filtering (Write calls to specific paths or patterns).

## Hook Response Protocol

A hook communicates its decision back to the agent loop via a structured response. The response is a discriminated union on whether execution was synchronous or asynchronous:

**Synchronous response** (hook result is available immediately):

| Field | Purpose |
|-------|---------|
| `continue` | Whether agent execution should continue |
| `suppressOutput` | Whether the hook's output should be hidden from the user |
| `stopReason` | If stopping, why. This is shown to the user. |
| `decision` | An explicit `block` or `allow` for hooks that make permission decisions |
| `hookSpecificOutput` | Per-event structured output (see below) |

**Asynchronous response** (hook delivers result later):

```python
{ async: true, asyncTimeout: 30 }
```

The `hookSpecificOutput` field carries per-event capabilities. For `PreToolUse` hooks, it can include:
- `updatedInput`: modified tool arguments (the tool runs with these args instead of the originals)
- `permissionDecision`: an explicit allow/block decision
- `additionalContext`: extra context injected into the tool execution environment

For `PostToolUse` hooks, it can include:
- `updatedMCPToolOutput`: a replacement for the tool's output

This is how hooks achieve surgical control. A `PreToolUse` hook can rewrite tool arguments (normalizing paths, sanitizing inputs, or transforming the request) without owning tool execution. A `PostToolUse` hook can transform or augment the tool output without re-running the tool.

**Exit code semantics for command hooks:**

```python
# Exit code 2 from a command hook = blocking result
# stderr goes to the model as context, and the tool call is blocked
exit 2

# JSON on stdout from a command hook = rich structured response
# The runner parses this and maps it to the response protocol
stdout: |
  {
    "continue": false,
    "stopReason": "Security policy violation: destructive operation detected",
    "decision": "block"
  }

# Exit code 0 with no stdout = success, no blocking, no message
```

## Error Isolation and Aggregation

Hook failures are isolated. A hook that crashes, times out, or produces an error never crashes the main agent loop. The failure is captured as one of four outcomes:

- `success`: hook ran, returned a result, no issues
- `blocking`: hook explicitly blocked execution (exit code 2 for command hooks, or `"continue": false` in JSON response)
- `non_blocking_error`: hook failed (exception, timeout, non-zero exit that isn't 2) but execution continues. Error is logged to the transcript.
- `cancelled`: hook was aborted (parent operation cancelled, or the hook's own timeout expired)

When multiple hooks are registered for the same event, their results are aggregated:

```python
aggregate_hook_results(results):
  # Any blocking result blocks: order doesn't matter
  blocking_results = [r for r in results if r.outcome == "blocking"]
  if blocking_results:
    return combined_blocking_error(blocking_results)

  # Non-blocking errors are logged but don't stop execution
  errors = [r for r in results if r.outcome == "non_blocking_error"]
  if errors:
    log_errors_to_transcript(errors)
    # execution continues past this point

  return success(aggregate_context(results))
```

A blocking result from any single hook blocks the entire operation, regardless of what other hooks returned. Multiple non-blocking errors are all logged. This is the right aggregation policy: a blocking hook is a veto, not a vote. You don't want "mostly approved" to mean "proceed."

The error isolation guarantee means hooks are safe to add without risk of destabilizing the main loop. A hook that occasionally fails with a non-blocking error is not a production incident. It logs, and execution continues. This safety property is what makes it reasonable to put `prompt` and `agent` hooks in production paths where the underlying LLM might occasionally time out.

## Safety: SSRF and Execution Boundaries

**HTTP hooks** require SSRF protection. Without it, a misconfigured hook could be weaponized to make the agent probe your internal network. The protection blocks requests to private IP ranges (RFC 1918: 10.x.x.x, 172.16-31.x.x, 192.168.x.x), cloud metadata endpoints (169.254.x.x), and CGNAT ranges. Loopback addresses (127.x.x.x, ::1) are intentionally allowed because `http://localhost/` is the primary use case for local audit servers and policy proxies.

The validation happens at DNS resolution time, not connection time. This prevents DNS rebinding attacks: a hostname that resolves to a public IP at validation time but to a private IP at connection time. IPv4-mapped IPv6 addresses are checked alongside IPv4 to prevent bypass via hex notation.

**Agent hooks** are sandboxed sub-agents. They run a full multi-turn loop with access to codebase tools, but with two constraints: they cannot spawn additional sub-agents, and they must return their result via a `StructuredOutput` tool call. This enforces that agent hooks terminate with an explicit result rather than running indefinitely. The sub-agent has a 50-turn cap. If it exhausts turns without calling `StructuredOutput`, the hook returns `outcome: 'cancelled'` silently (no error shown to the user, execution continues).

**Prompt hooks** use the direct LLM query path rather than the user-input processing path. This is deliberate: if a `UserPromptSubmit` hook internally called into the user-input processing pipeline, it would trigger another `UserPromptSubmit` hook, creating infinite recursion. Prompt hooks bypass this by querying the LLM directly, skipping the hook dispatch mechanism.

## Production Considerations

**1. Hook execution order is not guaranteed for concurrent async hooks.**

Hooks on the same event run in configuration order for synchronous execution. But when multiple hooks use `async: true`, they complete in arbitrary order. Design each hook to be fully independent. Don't write hook A that reads state set by hook B, even if hook B appears first in the configuration. The aggregation is commutative, not sequential.

**2. Agent hooks silently cancel on the 50-turn limit.**

An agent hook that doesn't reach a conclusion within 50 turns returns `outcome: 'cancelled'` with no error surfaced to the user. From the user's perspective, the hook simply didn't run. Write agent hook prompts that direct minimal, targeted exploration: "Verify the unit tests pass by reading the most recent test output file." If the goal requires open-ended reasoning without a clear termination path, use a `command` or `prompt` hook instead.

**3. Three registration sources contribute to the same unified configuration.**

Hooks can be registered from settings files (user-global, project-local, or session-local), from plugin callbacks (code-backed hooks registered programmatically), and from frontmatter or skill hooks (declared in skill configuration files). All three sources merge into the same unified hooks configuration structure and run through the same executor. There is no separate "plugin hook" runtime. Everything goes through the same path.

**4. The `once` flag runs a hook exactly once, then deregisters it.**

Some initialization scenarios require a hook that runs once and never again. For example, seeding a database on the first `SessionStart`. The `once` flag handles this: the hook fires, then removes itself from the registry. This is a registration pattern, not an ordering mechanism. Don't use `once` to control execution order between two hooks. If hook A must run before hook B on the same event, that's a different problem and requires a different approach.

**5. SSRF protection is bypassed when a proxy mediates DNS resolution.**

When a global network proxy is in use, the SSRF guard validates the proxy's IP address (typically a publicly-routable, allowed address) rather than the final destination. The guard cannot see through the proxy. In proxy-mediated or sandboxed environments, apply network-level controls (proxy allowlists, firewall rules) rather than relying solely on the application-layer SSRF guard.

## Best Practices

- **Do** use typed conditions, not hook-internal filtering. The `if` condition is evaluated before the hook executor is spawned. A non-matching hook costs nothing. An unfiltered hook that inspects input inside the hook logic still costs a subprocess or LLM call for every invocation.

- **Do** match the execution mode to the problem. Use `command` for deterministic validation (scripts, linters, formatters). Use `prompt` for classification and intent checks (one LLM call, fast). Use `agent` only for complex verification that genuinely requires exploration. Use `http` for external audit and webhook delivery.

- **Do** use `async: true` for observability and audit hooks. Hooks that log, audit, or notify external systems don't need to block the agent. Fire-and-forget hooks add zero latency to the critical path.

- **Don't** write hooks that depend on other hooks' side effects on the same event. The aggregation is parallel and order-independent. Any hook that assumes another hook's state was applied first will fail intermittently in ways that are hard to reproduce.

- **Don't** use agent hooks for ambiguous or open-ended goals. A vaguely-specified agent hook prompt that says "verify the changes look good" will exhaust its turn budget reasoning in circles and silently cancel. Be specific: name the exact verification step, the file to read, and the condition to check.

- **Do** register the analytics and audit hooks with `async: true`. The production pattern is: blocking hooks for safety and validation (command or prompt), fire-and-forget hooks for observability and audit (http or command with async).

- **Don't** rely on hook execution order for correctness. If two hooks on the same event produce conflicting results, the aggregation will combine them. It won't pick one and ignore the other based on position.

## Related

- [Tool System](/docs/tool-system): Hooks extend tool dispatch at the `PreToolUse` and `PostToolUse` events. Understanding the tool lifecycle makes hook timing clearer.
- [Safety and Permissions](/docs/safety-and-permissions): The `if` condition syntax in hooks reuses the same pattern-matching evaluator as permission rules, so the mental model transfers directly.
- [Command and Plugin Systems](/docs/command-and-plugin-systems): Hooks and commands are complementary extension mechanisms: commands are user-invoked, hooks are event-driven. Both extend behavior without modifying the core loop.
- [MCP Integration](/docs/mcp-integration): MCP tools are hookable via the same `PreToolUse` and `PostToolUse` events as built-in tools. The `Elicitation` and `ElicitationResult` events are specific to MCP server interactions.
- [Observability and Debugging](/docs/observability-and-debugging): Hook outcomes are logged as first-class events in the observability layer. Hook spans appear in session tracing, composing with the debugging tools to give complete visibility into hook behavior and timing.
- [Pattern Index](/docs/pattern-index): All patterns from this page in one searchable list, with context tags and links back to the originating section.
- [Glossary](/docs/glossary): Definitions for domain terms used on this page.


---

# MCP Integration

Source: https://claudepedia.dev/docs/mcp-integration
Section: Patterns / advanced-patterns

How the Model Context Protocol turns external services into agent tools: transport selection, tool bridging, and connection lifecycle management.

An agent's built-in tools are capable but finite. In production, you need to connect to external services: code repositories, databases, monitoring dashboards, deployment systems. Without a standard integration layer, each connection is a custom adapter with its own auth, error handling, schema translation, and reconnection logic. The maintenance cost compounds with every new service.

The Model Context Protocol standardizes this adapter layer. An MCP server advertises its capabilities through a well-defined API. The agent-side client queries those capabilities at connection time, constructs **Tool objects that are structurally identical to built-in tools**, and registers them with the agent's tool dispatcher. From that point on, the agent loop dispatches MCP tools and built-in tools through exactly the same code path. The loop never sees "this is an external tool." It sees a Tool object with a name, a schema, and a `call()` function.

The key insight: **MCP handles everything the agent loop doesn't want to know about**: transport selection, auth, schema translation, reconnection, and output normalization. By connection time, an MCP tool is structurally indistinguishable from a built-in tool. The complexity is absorbed at the boundary, not distributed into the loop.

## The Tool Bridge Pattern

The tool bridge is the core pattern. At connection time, the client queries `tools/list`, receives a schema for each tool the server exposes, and constructs a standard Tool object for each one:

```python
async function connect_server(server_name: str, config: ServerConfig) -> Connection:
  transport = create_transport(config)   # stdio, sse, http, ws: selected from config
  client = new Client()
  await client.connect(transport)

  # Discover what this server can do
  tools_response = await client.request("tools/list")

  # Bridge: construct agent-compatible Tool objects from MCP schema
  agent_tools = []
  for mcp_tool in tools_response.tools:
    agent_tool = {
      name: f"mcp__{server_name}__{mcp_tool.name}",
      description: truncate(mcp_tool.description, max=2048),
      input_schema: mcp_tool.input_schema,   # MCP server owns the schema
      is_read_only: mcp_tool.annotations?.read_only_hint ?? False,
      is_destructive: mcp_tool.annotations?.destructive_hint ?? False,
      call: (args, ctx) -> client.call_tool(mcp_tool.name, args)
    }
    agent_tools.append(agent_tool)

  return { client, tools: agent_tools }
```

Four things happen in this bridge:

1. **Discovery**: `tools/list` returns the server's capabilities: name, description, input schema, and annotations.

2. **Construction**: each MCP tool becomes a standard Tool object with the same interface as built-in tools. The dispatcher sees no difference.

3. **Namespacing**: `mcp__{server}__{tool}` prevents collisions across servers and makes tool ownership traceable in logs. When you see `mcp__payments__refund_transaction` in a trace, you know immediately which server and which operation.

4. **Annotation passthrough**: the server's hints (`read_only_hint`, `destructive_hint`) map directly to the concurrency and permission system. A tool marked `read_only` can run concurrently. A tool marked `destructive` triggers confirmation. See [Tool System](/docs/tool-system) for how the dispatcher uses these flags.

The description truncation (`max=2048`) is not optional. See Production Considerations for why.

## Transport Selection

MCP servers connect via one of five transports, selected by the `type` field in the server configuration:

| Transport | Connection | Use Case |
|-----------|-----------|----------|
| `stdio` | Local subprocess via stdin/stdout | Most common. Server runs as a child process. |
| `sse` | Server-sent events (HTTP long-poll) | Remote servers, legacy. One-way push with POST for client messages. |
| `http` | Bidirectional HTTP (POST + SSE) | The newer standard for remote servers (Streamable HTTP). |
| `ws` | WebSocket | Bidirectional streaming. Low-latency remote communication. |
| `sdk` | In-process, no network | Programmatic integration, testing. |

The transport is selected at configuration time, not at runtime. The client creates the appropriate transport object based on the config's `type` field, then all subsequent communication goes through the same `request()` / `notify()` interface regardless of transport. The tool bridge pattern above works identically for all five transports. `client.request("tools/list")` has the same API whether the underlying channel is a subprocess pipe, an HTTP stream, or a WebSocket.

**The HTTP transport** (sometimes called "Streamable HTTP") is the newest addition to the MCP specification. It replaces SSE for new remote integrations because it supports bidirectional communication without the limitations of server-sent events. SSE connections are one-way push channels. The client has to POST separately for messages to the server, which creates coordination overhead. HTTP transport handles both directions in a single channel. For new remote server integrations, prefer `http` over `sse`.

**A critical SSE-specific constraint:** SSE connections are long-lived GET requests that stay open to receive events. Standard HTTP timeout wrappers (commonly set at 60 seconds) will kill these streams. Any timeout middleware must explicitly skip GET requests. Applying a uniform timeout to all HTTP requests is a common implementation mistake that silently breaks SSE. See Production Considerations item 1.

## Connection Lifecycle

Every server connection is one of five states. All tool and resource fetching is gated on the `connected` state:

```python
# Server connection states: all tool/resource fetching gates on 'connected'
type ConnectionState =
  | "connected"    # tools available: everything works normally
  | "failed"       # connection error: return empty tool list
  | "needs-auth"   # auth required: offer auth tool only, no data tools
  | "pending"      # reconnecting: return empty tools until reconnection succeeds
  | "disabled"     # manually disabled: completely silent

function get_tools_for_server(connection: Connection) -> list[Tool]:
  if connection.state != "connected":
    return []    # all non-connected states return empty: agent loop never sees the error
  return connection.tools
```

The design principle: **all non-connected states return empty tool lists**. The agent loop gets a consistent interface regardless of server health. A server can fail, require auth, or be disabled. The loop never sees an error, it just has fewer tools available. This is the same fail-silent pattern used in circuit breakers: downstream failures are absorbed at the boundary.

Each non-connected state has a different implication:

- **`failed`**: connection error. Returns empty tool list. The system may attempt reconnection depending on failure type.
- **`needs-auth`**: server requires authentication. The client can surface an auth command to the user, but no data tools are exposed until auth completes.
- **`pending`**: reconnection in progress. Returns empty tools until the reconnection attempt resolves.
- **`disabled`**: manually disabled by user or admin. Completely silent: no errors, no auth prompts.

The `needs-auth` state has a caching concern: see Production Considerations item 3.

## Batched Startup

When an agent connects to many MCP servers at startup, concurrency limits matter. The critical insight is that **local servers and remote servers have fundamentally different resource profiles**:

```python
# Local servers (stdio/sdk): spawn child processes
# Too many concurrent spawns causes CPU/memory contention
# Safe default: 3 concurrent connections
process_batched(local_servers, batch_size=3, connect_fn)

# Remote servers (sse/http/ws): establish network connections
# These are just TCP handshakes, so much higher concurrency is safe
# Safe default: 20 concurrent connections
process_batched(remote_servers, batch_size=20, connect_fn)
```

Local servers spawn child processes. Creating 30 simultaneous processes stresses the operating system's scheduler and process table. Memory and CPU spike at startup. The batch size of 3 is conservative to avoid this contention.

Remote servers are just network connections. 20 concurrent TCP handshakes are routine and well within normal operating parameters. Using a conservative local batch size for remote connections wastes startup time unnecessarily.

Both batch sizes should be configurable via environment variables for environments with unusual constraints (for example, containerized deployments with strict process limits might need `batch_size=1` for local servers).

## Config Scope Hierarchy

MCP servers are configured at six scopes, each with different reach and precedence. From highest to lowest:

| Scope | Source | Notes |
|-------|--------|-------|
| Enterprise | Managed config (IT-controlled) | Exclusive control: when present, lowest-scope servers are not loaded |
| Local | Machine-specific settings | Per-machine overrides |
| User | User-global settings | Applies across all projects |
| Project | Project root config | Checked into the repository |
| Dynamic | Added at runtime | Via commands like `/mcp add` |
| Cloud | Fetched from provider | Lowest precedence |

**Enterprise exclusivity:** when an enterprise configuration exists, cloud-provided servers are never loaded. Enterprise has exclusive control over which external services the agent can access. This is a security boundary. It prevents users from sidestepping IT-approved tool sets by adding unapproved cloud servers.

**Deduplication by URL signature:** two servers configured at different scopes that point to the same endpoint are deduplicated. The deduplication key is the URL signature, not the server name, because names don't collide across scopes (each scope is independent) but two servers pointing to the same Slack integration absolutely would.

## Production Considerations

**1. SSE streams must skip the request timeout.**

SSE connections are long-lived. The GET request stays open for the duration of the event stream. Applying a standard 60-second timeout kills the stream silently. The client receives an error that looks like a network timeout, not like a protocol issue. The fix: any timeout middleware must explicitly skip GET requests while still applying timeouts to POST requests. This is a common implementation mistake when adding timeouts globally, because "add a 60s timeout to all HTTP requests" sounds like a safe default.

**2. Batch concurrency must be split by transport type.**

Local (stdio) servers spawn processes. Too many concurrent spawns causes operating system resource contention. Remote servers are just network connections. A single batch size for both is either too aggressive for local servers or too conservative for remote ones. Production defaults that work across a wide range of environments: 3 concurrent for local, 20 concurrent for remote. Make both configurable.

**3. Cache the needs-auth state to avoid startup hammering.**

Without a cache, every startup re-probes all servers that failed auth in the previous session, generating a wave of network requests to servers that will fail again immediately. A short-TTL cache (15 minutes is a reasonable default) avoids repeated failures while picking up legitimate re-authentication. Serialize writes to the cache file to prevent race conditions when multiple connections complete auth simultaneously.

**4. Truncate tool descriptions to prevent context poisoning.**

MCP servers generated from OpenAPI specifications commonly produce tool descriptions of 15-60KB. Without a cap (2048 characters is a practical default), a single server can exhaust the agent's context budget on every turn. The system prompt grows by 60KB, leaving less space for conversation history and tool results. This failure mode is invisible: the agent doesn't error, it just produces increasingly poor results as context pressure mounts. Truncation must apply to both individual tool descriptions and the server-level instructions string.

**5. Session expiry requires full cache invalidation.**

For HTTP-transport servers, sessions can expire server-side. The expiry signature is a specific error: HTTP 404 with a JSON-RPC error code indicating session not found. When this happens, the connection cache alone is not sufficient to clear. **All fetch caches** (tool lists, resource lists, server commands) from the expired session must be cleared. Without full invalidation, a reconnected client serves stale tool lists from the old session. The tool names in those lists may no longer exist on the server, causing every tool dispatch to fail.

**6. Memoization must be cleared on reconnect.**

On any connection close and reopen, all memoized fetch results become stale. A reconnected server may have different tools, different resources, or different commands than the previous session, especially if the server was updated between sessions. Clearing only the connection object while retaining cached tool lists means the agent dispatches requests to tools that the new session doesn't expose.

## Best Practices

- **Do** use the tool bridge pattern. Construct Tool objects that are structurally identical to built-in tools so the agent loop dispatch path stays uniform.
- **Do** namespace MCP tool names (`mcp__{server}__{tool}`) to prevent collisions and make tool ownership traceable in logs.
- **Do** return empty tool lists for non-connected servers. The agent loop should never see connection errors, just fewer available tools.
- **Do** split batch concurrency by transport type. Local process spawning and remote network connections have fundamentally different resource profiles.
- **Do** truncate tool descriptions before registering. Set a hard cap (2048 chars) to prevent context budget exhaustion.
- **Don't** apply uniform timeouts to all HTTP requests. SSE GET streams are long-lived and must be excluded from standard timeout wrappers.
- **Don't** skip tool description truncation. A single OpenAPI-generated server can add 60KB to every turn's context.
- **Don't** retain memoized tool lists across reconnections. Every reconnection must start with fresh tool and resource discovery.
- **Don't** implement a single availability check for auth state. Use the `needs-auth` connection state to distinguish "not connected yet" from "requires auth" from "failed with an error."

## Related

- [Tool System](/docs/tool-system): The tool dispatch system that MCP tools integrate into. MCP tools use the same concurrency classes, permission checks, and dispatch paths as built-in tools.
- [Command and Plugin Systems](/docs/command-and-plugin-systems): MCP servers can contribute commands to the agent's command registry through the `prompt` command type.
- [Safety and Permissions](/docs/safety-and-permissions): MCP tool annotations (`destructive_hint`, `read_only_hint`) feed directly into the agent's permission system.
- [Hooks and Extensions](/docs/hooks-and-extensions): MCP tools are hookable via the same `PreToolUse` and `PostToolUse` events as built-in tools. Hooks can intercept, modify, or block MCP tool calls using the same condition syntax.
- [Pattern Index](/docs/pattern-index): All patterns from this page in one searchable list, with context tags and links back to the originating section.
- [Glossary](/docs/glossary): Definitions for domain terms used on this page.


---

# Observability and Debugging

Source: https://claudepedia.dev/docs/observability-and-debugging
Section: Patterns / advanced-patterns

Three independent observability layers for agent systems: structured event logging, per-model cost tracking, and session tracing with span hierarchy, each answering a different debugging question.

An agent failure is harder to diagnose than a typical service failure. The timeline is long, spanning dozens of LLM calls, tool executions, and permission checks. The state is large, encompassing conversation history, tool outputs, and compacted memory summaries. The cost varies: a single debugging session can cost more than a day's worth of routine usage. And the failure mode is often silent: the agent produces a plausible-looking output that is quietly wrong.

You need three distinct lenses to see into a running agent system, and each lens answers a different question:

- **"What happened?"**: Structured event logging. A discrete timeline of named events (tool calls, LLM requests, permission grants, session starts) that you can query after the fact or stream in real time.
- **"What did it cost?"**: Cost tracking per model, per session, and across sessions. Broken down by token type so you can see whether prompt caching is working.
- **"How long did it take and where?"**: Session tracing. A span hierarchy that maps every operation to its duration, parent, and cause, from the user prompt down to individual tool calls and permission checks.

These three systems are independent. You can enable or disable each one without affecting the others. In a full production deployment, all three run simultaneously and complement each other: the event log tells you *what* happened, cost tracking tells you *what it cost*, and tracing tells you *where time went*.

## Layer 1: Structured Event Logging

The foundational problem with event logging in agent systems is startup ordering: events are produced before the logging sink is ready. The agent starts processing before initialization is complete. If you drop events during this window, your logs are unreliable from the first line.

The **sink queue pattern** solves this:

```python
# Events are logged at any time, before the sink exists
log_event("startup_began", { mode: 1 })    # goes to in-memory queue
log_event("config_loaded", { success: True }) # still queued

# Later: sink attaches during initialization
attach_analytics_sink(sink)   # drains the queue via microtask scheduling

# After attachment: events route directly to the sink
log_event("session_started", { duration_ms: 250 })  # direct
```

The queue is a FIFO buffer that accumulates events before the sink is ready. When the sink attaches, it drains the queue via a microtask (not a synchronous loop) to avoid blocking the main execution path. After draining, all subsequent events go directly to the sink.

Sink attachment is **idempotent**: the first call attaches and wins, subsequent calls are silently no-ops. This prevents double-routing if initialization code is called more than once, but it also means the order of initialization matters. If a test environment resets state between test cases, it must explicitly detach and reattach the sink, not just re-call the attach function.

**Metadata type restriction** is the other key design decision: event metadata accepts only `boolean`, `number`, and `undefined` values, not strings.

```python
# Correct: safe metadata, numbers and booleans only
log_event("tool_called", {
  duration_ms: 450,
  success: True,
  retry_count: 0,
  cache_hit: False,
})

# What the type system prevents:
# log_event("tool_called", { file_path: "/home/user/secret.txt" })
# log_event("request_done", { user_prompt: "how do I..." })
```

This is not a limitation. It is a deliberate safety contract. Strings can contain file paths, user prompts, code snippets, and other PII. By restricting metadata to numbers and booleans, accidental PII leakage into the event log becomes a compile-time error rather than a runtime incident. When a string genuinely needs to be logged, it must go through a separate verified-metadata path that carries explicit acknowledgment that the string has been reviewed for PII, a friction point that makes leakage a deliberate, auditable act rather than an accident.

**Two separate logging systems run simultaneously** and they must be independent:

- **First-party logging** routes to internal analytics infrastructure. Batched to a backend endpoint, enriched with session, environment, and deployment metadata. Subject to internal retention policies.
- **Third-party logging** routes to customer-configured telemetry backends using standard export formats. Receives only events the customer has opted into. Uses a separate logger provider instance.

The strict separation matters: internal analytics events must not appear in customer telemetry pipelines, and customer events must not leak into internal analytics. Using separate provider instances (rather than a shared provider with different destinations) is the only reliable way to enforce this boundary at the type level, not the configuration level.

**Event sampling** allows high-frequency events to be logged at rates below 100%. A sample rate is attached to each event's metadata, so downstream analysis systems can correct for the sampling when computing aggregate statistics. This prevents high-volume tool-call events from flooding the event store while still providing statistically accurate aggregates.

## Layer 2: Cost Tracking

Cost tracking serves two needs: real-time budget awareness during a session, and post-session analysis to understand where money went. Meeting both needs requires tracking at three scopes simultaneously.

**Per-API-call scope**: every LLM API response carries token usage. Accumulate immediately:

```python
# Per API call: update all three scopes
add_to_session_cost(
  cost=calculate_usd_cost(model, usage),
  usage=usage,    # input_tokens, output_tokens, cache_read, cache_creation
  model=model,    # the model that processed this call
)
```

**Per-session scope**: in-memory aggregation across all API calls in the current session:

```python
# Retrieve breakdown at any point during the session
for model, usage in get_model_usage().items():
  total = usage.input_tokens + usage.output_tokens
  print(f"{model}: ${usage.cost_usd:.4f}, "
        f"{total} tokens ({usage.cache_read_tokens} from cache)")

# Total cost across all models
session_total = get_total_cost_usd()
```

**Cross-session persistence**: written to project config, keyed by session ID:

```python
# Before session ends: persist cost state
save_current_session_costs()   # writes to project config with session_id key

# On session resume: restore state for the matching session
restore_cost_state_for_session(session_id)
# Only restores if session_id matches, so unrelated sessions don't contaminate each other
```

The **four token types** must be tracked separately because their billing differs. Input tokens and output tokens are billed at standard rates. Cache-read tokens (served from a provider's prompt cache) are billed at a fraction of the input rate. Cache-creation tokens (written into the cache on first use) may carry an additional write charge. Aggregating all of these into a single "tokens used" counter makes it impossible to tell whether prompt caching is working. The cache-read count being low means you're paying full price for prompts that should be cached.

Multiple API model strings that map to the same underlying model are summed by canonical model name. This means a session that uses `fast-model-v1` and `fast-model-v1-turbo` (different API identifiers, same billing family) shows their cost as a single line item under the canonical name, which is what you want for budget tracking.

## Layer 3: Session Tracing

Session tracing maps the agent's activity to a span hierarchy, a tree of timed operations with parent-child relationships. The hierarchy mirrors the agent loop lifecycle:

```python
interaction (user prompt to agent response)
  ├── llm_request (one LLM API call)
  ├── llm_request (another LLM API call if the first triggered more)
  ├── tool (one tool execution)
  │   └── blocked_on_user (awaiting permission grant)
  └── tool (another tool execution)
      └── hook (beta tracing only: hook spans are a separate opt-in)
```

The key implementation insight is **context-local propagation**: spans don't need to be passed explicitly through every function call. An async-local storage context holds the current span, and any code that runs in that context automatically knows its parent:

```python
# Interaction span: root of the trace
interaction_span = start_interaction_span(user_prompt)
# Context-local storage now holds this span as the active context

# LLM request: auto-parented to interaction via context-local storage
llm_span = start_llm_request_span(model="fast-model")
response = await query_llm(messages)
end_llm_request_span(llm_span, metadata={
  input_tokens: response.usage.input_tokens,
  output_tokens: response.usage.output_tokens,
  success: True,
})
# Note: we pass llm_span explicitly to end it. See Production Considerations.

# Tool span: also auto-parented to interaction
tool_span = start_tool_span("bash", {"command_type": "read"})
# ... tool executes ...
end_tool_span(tool_result, result_tokens=42)

# Interaction ends: all child spans complete before this
end_interaction_span()
```

Context-local propagation means code deep in the call stack can start and end its own spans without the caller needing to thread a context parameter through every layer. The tracing system works across async boundaries. A span started in an async task is the parent of spans started in any code that awaits in that task's context.

**Three tracing backends run independently** and can all be active simultaneously:

| Backend | What It Adds | Use Case |
|---------|-------------|----------|
| Standard enhanced telemetry | Core span hierarchy (interaction, llm_request, tool, blocked_on_user) | Production observability, dashboards |
| Beta tracing | Additional attributes on existing spans and hook spans | Development debugging, hook behavior analysis |
| Performance tracing | Separate trace format for profiling tools | Performance profiling, flamegraph analysis |

Hook spans appear only in beta tracing, not in standard enhanced telemetry. Performance tracing emits to a separate trace file optimized for profiling tools, not to the same OTLP endpoint as standard telemetry.

**Orphan span cleanup** is a non-obvious requirement. In production, aborted LLM streams and uncaught exceptions leave spans open. The code that would call `end_span()` never runs. Without cleanup, the active span collection grows indefinitely over the lifetime of a long-running agent process.

The solution is a background cleanup interval that uses weak references:

```python
# Active spans stored as weak references
active_spans: dict[span_id, WeakRef[SpanContext]] = {}

# Background cleanup: runs every 60 seconds
cleanup_orphan_spans():
  cutoff = now() - SPAN_TTL_MS   # 30 minutes
  for span_id, ref in list(active_spans.items()):
    span = ref.deref()
    if span is None:              # GC collected it, already gone
      del active_spans[span_id]
    elif span.started_at < cutoff:  # alive but open too long
      span.end(status="abandoned")
      del active_spans[span_id]
```

The weak reference is the key: when context-local storage clears (task ends, context exits) and no other code holds a reference to the SpanContext, the GC collects it. The next cleanup pass finds a stale WeakRef and removes the entry. Spans that are still alive after 30 minutes are explicitly ended with an "abandoned" status. Without the TTL, a single crashed agent turn can leak an open span that stays in memory for the lifetime of the process.

## Debugging Agent Failures

This is the practical section: how to use the three layers to diagnose specific failure types.

**Cost spikes:** Go to Layer 2 first. The per-model breakdown shows which model and which operation type is driving the cost. Check the ratio of `cache_read_tokens` to `input_tokens`. A low cache-read ratio means prompt caching is ineffective. Either the system prompt is changing too frequently between calls (invalidating the cache), or the caching configuration is wrong. This is one of the most common silent cost problems: everything works, it just costs 10x more than it should.

**Slow tools:** Go to Layer 3. Tool spans show exact execution duration from the point the tool was dispatched to the point its result was returned. If a tool span is slow but the tool itself is fast, look for `blocked_on_user` sub-spans. They indicate permission prompts that required human approval. The bottleneck isn't the tool. It's the permission gate. Long permission wait times are invisible without tracing because the tool reports success with correct output. Only the duration reveals the issue.

**Permission failures:** Layer 3 `blocked_on_user` spans correlate with the permission decision. High counts of blocked_on_user spans that resolve quickly (auto-approved or already-cached) are normal. A spike in spans where `blocked_on_user` duration is long means the system is waiting for human approval on operations that should be pre-approved. Cross-reference with the event log: if the permission system is emitting denial events, the classifier may be over-triggering on safe operations.

**Hook failures:** Layer 1 event log captures hook outcomes. Non-blocking hook errors are logged to the session transcript, but the transcript is long, and a non-blocking error mid-session is easy to overlook. If you suspect hook misbehavior, filter the event log for hook-outcome events and look for `non_blocking_error` outcomes. Agent hooks that silently cancel show as `outcome: "cancelled"` with no associated error message. The only way to detect them is through the event log or the tracing layer.

**Missing events:** If events appear inconsistently in the log (some sessions have them, some don't) check sink attachment timing. The sink queue buffers events before the sink attaches, but the queue has a maximum capacity. If the agent produces a large burst of events during startup before the sink is ready, early events may be dropped. Fix: ensure the sink attaches as early as possible in the initialization sequence, before any event-producing code runs.

**Incorrect cost attribution in parallel systems:** In systems that run multiple LLM requests concurrently, always pass the explicit span object returned by `start_llm_request_span()` to `end_llm_request_span()`. The tracing system has a fallback that finds "the most recently started LLM request span" when no span is provided, but in parallel execution, this may be the wrong span. Token counts can be attributed to the wrong model, making cost breakdowns appear correct while being silently wrong.

## Privacy and Redaction

Prompt content is **redacted by default**. The agent collects user prompts, system prompts, and tool results, all of which may contain PII. By default, none of this content flows into the tracing backend or the event log. Only metadata (durations, token counts, model names, success/failure indicators) is emitted.

Prompt content logging is opt-in via an environment variable. When enabled, user prompt text is included in the span attributes for the `interaction` span, and tool inputs/outputs appear in the tool span attributes. This opt-in is session-level: enable it for development debugging sessions, disable it for production.

The **verified-metadata contract** extends this principle to structured metadata. When a developer needs to log a string value (for example, a sanitized version of a file path or a hash of a prompt) they must go through a separate verified-metadata path. The type system marks these values with an explicit acknowledgment that the developer has reviewed the value for PII. This isn't bureaucratic overhead. It's a mechanism that makes PII exposure a deliberate, reviewable act rather than a typo.

The practical result: in a default configuration, you can share your event logs and traces freely. They contain no user content. This matters in regulated industries (healthcare, finance, legal) where sharing operational logs with external monitoring vendors requires data review processes. When privacy is the default, you start from a safe baseline and opt in deliberately.

## Production Considerations

**1. Orphan span TTL prevents indefinite memory growth.**

Without explicit TTL cleanup, every aborted LLM stream or uncaught exception leaves an open span. In a long-running agent process that handles hundreds of sessions, this accumulates. The 30-minute TTL with background cleanup is the minimum safe configuration. For agents with very long turns (complex multi-step tasks that run for hours), extend the TTL proportionally. Monitor the count of spans ended with `abandoned` status as a health metric. A rising count means something is failing to close spans normally.

**2. Parallel LLM requests require explicit span passing.**

When multiple LLM requests run concurrently (common in warmup scenarios, parallel tool evaluation, or multi-agent coordination) the legacy "find the most recent llm_request span" fallback is unreliable. Always pass the specific span returned by `start_llm_request_span()` to `end_llm_request_span()`. The span is returned explicitly for this reason. Systems that skip this step will see correct token totals but incorrect per-model attribution in parallel-execution scenarios.

**3. Analytics sink attachment is idempotent. First attachment wins.**

If your initialization code is refactored and the analytics sink gets attached in two different code paths, the second attachment silently no-ops. From the caller's perspective, the sink attached successfully, but events from any code that expected the second sink are silently routed to the first sink. In test environments that reset global state between test cases, explicit teardown (detach, then reattach) is required. Simply calling attach again will not reroute events.

**4. The event queue has a bounded capacity.**

The in-memory event queue that buffers events before the sink is ready has a maximum capacity (typically on the order of a few thousand events). During a particularly busy startup (many plugins loading, many configuration events firing) the queue can fill before the sink attaches. Events beyond the capacity are dropped silently. The fix is always the same: attach the sink earlier in the initialization sequence. If startup necessarily produces many events before a sink can attach, increase the queue capacity for your workload.

**5. Dual logging systems use separate provider instances.**

Running the first-party and third-party loggers on separate provider instances is not redundancy. It is isolation. A single shared provider with two exporters could route an internal analytics event to a customer OTLP endpoint if the filtering configuration has a bug. Separate instances make this impossible at the architectural level: the first-party provider cannot accidentally reach the third-party exporter because it doesn't hold a reference to it.

## Best Practices

- **Do** track cost per-model with the four token types separated. Aggregating all tokens into one counter hides prompt caching effectiveness, a critical optimization signal in production agent systems.

- **Do** pass explicit span objects to span-ending calls. Never rely on the "most recent span" fallback in concurrent systems. The explicit span is returned for a reason.

- **Do** use the verified-metadata contract for any string that might contain PII. The friction is the point. Make PII logging deliberate, not accidental.

- **Don't** log prompt content by default. Default-off is the only safe default for user data. Opt in explicitly for development sessions, and ensure opt-in is per-session, not global.

- **Don't** rely on GC alone for span cleanup. Long-running agents need a background TTL cleanup interval. Weak references plus a periodic sweep is the production pattern.

- **Do** attach the analytics sink exactly once, as early as possible in the initialization path, before any event-producing code runs. If you have two initialization code paths, pick one canonical path and route both through it.

- **Do** monitor the `blocked_on_user` span duration distribution. Unexpectedly long values mean users are waiting at permission prompts, a performance problem that looks like tool latency without tracing.

- **Don't** aggregate cost by session without also tracking by model. "This session cost $0.50" is less useful than "this session cost $0.45 in the reasoning model and $0.05 in the fast model, and 80% of the fast model cost was cache misses."

## Related

- [Agent Loop](/docs/agent-loop): The span hierarchy mirrors the agent loop lifecycle. Each loop iteration corresponds to one or more spans in the session trace.
- [Tool System](/docs/tool-system): Tool spans are the primary debugging surface for slow operations. The tool's concurrency class and dispatch behavior affect span timing.
- [Hooks and Extensions](/docs/hooks-and-extensions): Hook spans appear in beta tracing. Hook outcomes are logged as first-class events. Observability and hooks compose to give complete visibility into hook behavior and timing.
- [Error Recovery](/docs/error-recovery): The event log feeds the error recovery decision tree: logged hook failures, tool errors, and permission denials are the raw signals that error recovery classifies as retryable or permanent.
- [Streaming and Events](/docs/streaming-and-events): The streaming event pipeline is the primary data source for observability. Event logging, cost tracking, and session tracing all consume the same typed event stream that the streaming system produces.
- [Pattern Index](/docs/pattern-index): All patterns from this page in one searchable list, with context tags and links back to the originating section.
- [Glossary](/docs/glossary): Definitions for domain terms used on this page.


---

# The Advisor Strategy

Source: https://claudepedia.dev/docs/advisor-strategy
Section: Patterns / advanced-patterns

How to boost agentic task performance without running a frontier model end-to-end: pair a fast executor (Sonnet/Haiku) with an Opus advisor that intervenes only at decision points.

Running a frontier model on every step of an agentic task is expensive. Running a fast model on every step is cheap but leaves hard decisions underserved. The advisor strategy threads this needle: a capable executor model handles the task end-to-end, and a frontier advisor model enters only at the moments that require it.

The pattern inverts the usual hierarchical instinct. In a traditional multi-agent setup, a coordinator delegates to workers. Here, the executor handles everything directly — tool calls, result processing, iteration — and escalates to the advisor only when it hits a decision it cannot confidently resolve on its own. The advisor provides guidance. It never calls tools. It never generates user-facing output. When the consultation is complete, the executor continues with the advisor's guidance incorporated.

**The key insight**: frontier intelligence is most valuable at decision forks, not on routine steps. Most of what an agent does — reading files, running searches, formatting outputs — doesn't require the best model available. By targeting the advisor only at the moments that warrant it, you get frontier-level accuracy at a fraction of the cost of running a frontier model on every turn.

## Architecture

```mermaid
flowchart TD
    E["Executor\nSonnet\nRuns every turn"] -->|Tool call| A["Advisor\nOpus\nOn-demand"]
    A -.->|Sends advice| E
    E -->|Read / write| SC["Shared context\nConversation, tools, history"]
```

The entire flow happens within a single API call. No extra round-trips, no orchestration layer, no additional infrastructure. The handoff to the advisor is internal to the model invocation.

## Declaring the Advisor Tool

The advisor is declared as a tool in the API request body using the identifier `advisor_20260301`:

```python
response = llm.create_message(
    model=EXECUTOR_MODEL,       # fast, cost-efficient executor (e.g. Sonnet or Haiku)
    max_tokens=8096,
    tools=[
        {
            "type": "advisor_20260301",   # declares the advisor capability
        },
        # ... your other task tools
    ],
    messages=[
        {"role": "user", "content": task}
    ]
)
```

The executor now has access to an advisor tool alongside its regular tools. When it encounters a decision it cannot resolve — an ambiguous error, a multi-path trade-off, a high-stakes choice — it can invoke the advisor tool to consult the frontier model before proceeding.

## The max_uses Cap

A `max_uses` parameter limits how many times the advisor can be invoked per task:

```python
response = llm.create_message(
    model=EXECUTOR_MODEL,
    max_tokens=8096,
    tools=[
        {
            "type": "advisor_20260301",
            "max_uses": 3,   # advisor can be consulted at most 3 times per task
        },
        # ... other tools
    ],
    messages=[{"role": "user", "content": task}]
)
```

The `max_uses` cap serves two purposes:

**Cost control.** Advisor turns bill at frontier rates. Without a cap, a confused executor could consult the advisor on every step, eliminating the cost benefit. A cap forces the executor to reserve advisor invocations for genuine decision points.

**Behavioral framing.** Knowing the budget is limited, the executor treats the advisor as a scarce resource. It attempts to resolve ambiguities on its own before escalating. This produces a more capable executor behavior than one that can freely offload any uncertainty.

Typical advisor responses are short — 400–700 tokens per consultation — keeping advisor costs predictable even when the cap is reached.

## Token Billing

The API reports executor and advisor tokens separately. This transparency is deliberate:

```python
# Token usage is reported separately per model tier
usage = response.usage

log(f"Executor ({EXECUTOR_MODEL}) tokens: {usage.executor_input_tokens} in / {usage.executor_output_tokens} out")
log(f"Advisor ({ADVISOR_MODEL}) tokens:  {usage.advisor_input_tokens} in / {usage.advisor_output_tokens} out")

# Executor turns bill at executor rates; advisor turns bill at frontier rates
# Track both independently to validate cost assumptions per task type
```

Executor turns bill at executor rates. Advisor turns bill at frontier rates. There is no blended rate. You can track exactly what the advisor cost contributed to a task and optimize the `max_uses` cap accordingly.

## Performance Benchmarks

The advisor strategy has been benchmarked on three task categories:

**SWE-bench Multilingual (software engineering)**

| Configuration | Score | Cost per task |
|---------------|-------|---------------|
| Executor alone | baseline | baseline |
| Executor + frontier advisor | +2.7 pp | -11.9% |

The executor with a frontier advisor outperforms the executor alone while reducing cost per task. The advisor's targeted interventions resolve the ambiguous decisions that cause the executor to make suboptimal choices, without the overhead of running the frontier model on routine steps.

**BrowseComp (web research)**

| Configuration | Score | Cost vs mid-tier alone |
|---------------|-------|------------------------|
| Fast model alone | 19.7% | ~15% |
| Fast model + frontier advisor | 41.2% | ~15% |
| Mid-tier model alone | — | 100% (baseline) |

A fast model with a frontier advisor more than doubles the fast model's standalone score while costing 85% less than running a mid-tier model alone on the same tasks. This is the clearest demonstration of the strategy's cost-efficiency profile: the advisor elevates a fast, cheap model to well above what a mid-tier model can do on its own.

**Terminal-Bench**

Improvements are also observed on Terminal-Bench (terminal-based task completion), though the exact delta has not been publicly reported.

## When the Executor Consults the Advisor

The executor decides when to invoke the advisor. Common patterns:

**Ambiguous error diagnosis.** The test output suggests multiple possible root causes. The executor has tried one fix and it didn't work. Instead of trying all combinations, it consults the advisor: "I see a null pointer exception in line 42, but the stack trace also suggests a race condition. Which should I investigate first?"

**Multi-path architectural decisions.** The task could be solved by refactoring the existing module or by introducing a new abstraction. The executor can complete either path but doesn't know which the user prefers or which is more consistent with the codebase's conventions.

**High-stakes irreversible actions.** Before deleting files, dropping a database table, or making a network request to an external service, the executor escalates to confirm the decision. This is especially valuable when the executor has been given broad tool permissions and needs a second opinion before acting destructively.

**Novel problem domains.** The executor encounters a pattern it has low confidence reasoning about — a less common programming language, an unusual API error code, a domain-specific constraint. The advisor, with higher overall capability, can reason more reliably about novel inputs.

The advisor does not decide when to be consulted. The executor decides. This is the critical structural difference from a hierarchical multi-agent setup where the coordinator assigns tasks.

## Advisor Behavior

The advisor is constrained by design:

- **No tool calls.** The advisor cannot call tools. It can only read what the executor has already gathered and return guidance in text.
- **No user-facing output.** The advisor's response goes to the executor, not to the user. The user sees only the executor's final answer.
- **Short responses.** Advisor responses are guidance, not complete solutions. They steer the executor without taking over execution.

This constraint is what keeps the advisor economical. A 500-token advisor response that resolves a decision fork costs far less than running a frontier model for an entire multi-turn agentic task.

## Comparison with Other Multi-Model Patterns

**vs. Full frontier model execution.** Running a frontier model end-to-end gives peak intelligence on every turn but at peak cost. The advisor strategy achieves near-frontier accuracy on hard tasks at substantially lower cost by targeting the frontier model where it matters.

**vs. Coordinator/worker multi-agent.** In the coordinator pattern, a top-level agent orchestrates multiple specialized workers. The advisor pattern has only one executor. There is no task decomposition, no parallel workers, and no synthesis step. The advisor supplements the executor's reasoning; it doesn't replace its execution role.

**vs. Model routing.** A router selects a model per request at the task level. The advisor strategy selects per decision point within a single task execution. Routing is coarse-grained (whole-task); the advisor is fine-grained (per-decision).

**vs. Chain-of-thought prompting.** CoT prompting asks the model to reason through steps before answering. The advisor strategy invokes a different, more capable model for specific reasoning steps. CoT improves the executor's own reasoning; the advisor introduces external, higher-quality reasoning.

## Production Considerations

**1. Set max_uses based on task complexity, not conservatism.**

A `max_uses` of 1 may be sufficient for moderately complex tasks (the executor solves most steps independently and escalates once on the hardest decision). Tasks with multiple high-stakes branch points may need 3–5 uses. Measure advisor usage across real tasks before setting a hard limit — if the executor frequently hits the cap before completing, the cap is too low for the task profile.

**2. The advisor cannot recover from bad tool results.**

The advisor sees only what the executor passes to it. If the executor has received malformed tool results or accumulated bad state in its context, the advisor can only reason about what it's given. Design your executor's escalation logic to pass sufficient context: the question being asked, the relevant prior steps, and the specific decision fork.

**3. Advisor latency adds to end-to-end task duration.**

Each advisor consultation adds an additional model call's worth of latency. For latency-sensitive tasks, profile the advisor invocation delay and consider whether `max_uses=1` (reserving the advisor for the single most critical decision) is a better trade-off than `max_uses=5`.

**4. The executor's model choice matters.**

A fast model with a frontier advisor is most cost-efficient for research and browsing tasks. A mid-tier model with a frontier advisor is most accurate for complex software engineering tasks. The right executor depends on task type. Don't default to mid-tier + advisor for everything — measure fast model + advisor on your task category first.

**5. Track advisor invocation count per task in production.**

If a task type rarely invokes the advisor (< 10% of tasks), the `max_uses` cap is effectively unused — consider whether you need the advisor for that task type at all. If a task type frequently hits the cap, the executor is over-reliant on escalation and you should examine whether the executor model, system prompt, or task framing needs improvement.

## Best Practices

- **Do** use the advisor strategy when tasks have clear decision forks — ambiguous choices, high-stakes actions, or domains where the executor has low confidence.
- **Do** set `max_uses` deliberately based on task complexity measurements, not arbitrarily.
- **Do** monitor executor vs. advisor token spend per task to validate that the cost profile matches expectations.
- **Don't** use the advisor strategy for tasks that are uniformly complex — if every step requires frontier-level reasoning, run the frontier model end-to-end.
- **Don't** over-escalate: if the executor invokes the advisor for routine decisions, tighten the escalation criteria in the system prompt.
- **Don't** treat the advisor as a fallback for a poorly-tuned executor. The advisor amplifies a capable executor. It doesn't compensate for fundamental executor weaknesses.

## Related

- [Multi-Agent Coordination](/docs/multi-agent-coordination): The broader family of patterns for orchestrating multiple agents. The advisor strategy is a lightweight alternative for tasks that don't need full coordinator/worker decomposition.
- [Tool System](/docs/tool-system): The advisor is declared as a tool via `advisor_20260301`. Understanding the tool declaration and dispatch model clarifies how the executor-advisor handoff works mechanically.
- [Safety and Permissions](/docs/safety-and-permissions): The advisor pattern is particularly useful before high-stakes or destructive tool calls, where a second opinion from a frontier model is valuable before executing irreversible actions.
- [Pattern Index](/docs/pattern-index): All patterns from this page in one searchable list, with context tags and links back to the originating section.
- [Glossary](/docs/glossary): Definitions for domain terms used on this page.


---

# Pattern Index

Source: https://claudepedia.dev/docs/pattern-index
Section: Reference

Complete catalog of 84 named patterns across all ClaudePedia pages, searchable by domain, tags, and keyword.

This index catalogs named patterns across ClaudePedia's 14 content pages. Each entry identifies the pattern, the page that explains it in depth, a one-sentence description, and tags for filtering. Pages are identified by their `id` field — match that against the filename in `v2/core/`, `v2/advanced/`, or `v2/quickstart.md`.

## Quickstart

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Messages List as Memory | quickstart | Using the conversation messages array as the agent's working memory — the model sees the full list on every turn, so appending tool results is how the agent learns. | messages, memory, state, beginner |
| Termination via Absence of Tool Calls | quickstart | Ending the agent loop when the model produces a response with no tool_calls — the agent doesn't announce completion, it simply stops requesting tools. | termination, loop, control-flow, beginner |
| Schema as Interface | quickstart | Treating the tool's metadata (name, description, parameter types) as the contract the model reasons about, while the function body remains invisible plumbing. | schema, tools, interface, beginner |

## Agent Loop

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Two-State Machine | agent-loop | The core agent loop alternates between two states: awaiting a model response and dispatching tool calls. | state-machine, loop, control-flow |
| Generator Pattern | agent-loop | Using an async generator as the loop body so callers can observe each turn without coupling to the loop's internals. | async-generator, streaming, composable |
| State Struct Pattern | agent-loop | Carrying all mutable loop state in a typed struct replaced wholesale at every continue site, with auditable continuation reasons. | state, immutability, auditability |
| Three-Path Abort | agent-loop | Handling abort signals at three distinct phases: mid-streaming (Path A), post-stream pre-dispatch (Path B), and mid-execution (Path C). | abort, cancellation, correctness |
| Synthetic Tool Result Emission | agent-loop | Emitting error `tool_result` messages for every outstanding `tool_use` before aborting, to keep conversation history valid for the next API call. | abort, tool-result, correctness |
| Diminishing-Returns Budget Check | agent-loop | Stopping token budget loops when three consecutive continuations each produce fewer than 500 new tokens, not just at a percentage threshold. | token-budget, termination, production |
| Tombstone Pattern | agent-loop | Marking orphaned partial assistant messages for targeted removal from history and rendering, rather than truncating surrounding context. | orphaned-messages, history, fallback |
| Reactive Compact Guard | agent-loop | A session-scoped boolean that prevents reactive compaction from triggering more than once per session, surviving stop hook re-entries. | compaction, session-state, guard |

## Tool System

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Typed Function With Metadata | tool-system | Defining a tool as a function body plus a schema, concurrency class, and behavioral flags — the metadata is the design, not the function body. | schema, metadata, design |
| Concurrency Classes | tool-system | Classifying tools as READ_ONLY, WRITE_EXCLUSIVE, or UNSAFE to determine whether parallel dispatch is safe for a given invocation. | concurrency, dispatch, safety |
| Partition-Then-Gather | tool-system | Splitting tool calls into consecutive safe/unsafe batches, running batches sequentially while parallelizing within each safe batch. | dispatch, concurrency, batching |
| Two-Phase Validation | tool-system | Running schema validation (shape/types) followed by semantic validation (business logic) as two distinct, always-both-running phases. | validation, schema, correctness |
| Fail-Closed Defaults | tool-system | Defaulting every unset safety flag to the most restrictive value, so missing declarations default to safe rather than permissive. | safety, defaults, correctness |
| Dynamic Tool Sets | tool-system | Using a `refresh_tools` callback to update the available tool list between turns while keeping it immutable within a single turn. | dynamic, refresh, lifecycle |
| Sibling Abort | tool-system | Using a child abort controller to cancel all concurrently running tools in a batch when one fails, without affecting the parent session. | abort, concurrency, batch |
| Result Size Offload | tool-system | Persisting oversized tool results to a temp file and sending the model a preview, preventing large outputs from consuming the context window. | context, offload, size-limit |

## Memory and Context

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Hierarchy of Forgetting | memory-and-context | A four-level memory model (in-context → summary → long-term → forgotten) where each level trades fidelity for space. | memory, hierarchy, levels |
| Compaction Pipeline | memory-and-context | Running context interventions in cost order: trim tool results → drop old messages → session memory compact → LLM summarize. | compaction, cost-order, pipeline |
| Autocompact Circuit Breaker | memory-and-context | Disabling compaction after N consecutive failures to prevent infinite API hammering on irrecoverably large contexts. | circuit-breaker, compaction, resilience |
| Forked-Agent Extraction | memory-and-context | Running a background sub-agent after each turn to extract facts, sharing the parent's prompt cache but with restricted tools and a hard turn budget. | extraction, sub-agent, background |
| Closed Taxonomy | memory-and-context | Limiting long-term memory to a fixed set of four types (user, feedback, project, reference) to prevent memory from becoming a junk drawer. | taxonomy, classification, memory |
| Extraction Cursor | memory-and-context | Tracking which messages have been processed by the extractor via a UUID cursor that advances only on successful runs. | cursor, at-least-once, extraction |

## Prompt Architecture

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Two-Zone Model | prompt-architecture | Splitting the system prompt into a static zone (identical for all users/sessions/turns) and a dynamic zone (per-session or per-turn content). | static-dynamic, caching, zones |
| Section Registry | prompt-architecture | Registering each prompt section with explicit cache intent (cached vs volatile) rather than concatenating strings directly. | registry, caching, sections |
| Cache-Intent Two-Function API | prompt-architecture | Using `register_cached_section` and `register_volatile_section` with a mandatory reason argument to make cache intent explicit and auditable. | caching, api-design, intent |
| Five-Level Priority Chain | prompt-architecture | Resolving the effective system prompt through five ordered levels: override → coordinator → agent → custom → default. | priority, prompt-assembly, modes |
| Append-Tail Pattern | prompt-architecture | Injecting content at the end of the assembled prompt outside the priority chain, without modifying any chain level or fragmenting cache keys. | injection, tail, flexibility |
| Numeric Calibration | prompt-architecture | Using explicit numeric constraints in the system prompt (max sentences, confidence thresholds) rather than vague adjectives that the model interpolates. | calibration, numeric, precision |

## Error Recovery

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Escalation Ladder | error-recovery | A four-rung failure response ordered by cost: retry (latency) → fallback (quality) → degrade (capability) → fail (task). | escalation, tiered, recovery |
| Circuit Breaker | error-recovery | Tracking failure rates and blocking calls to a failing service, preventing retry storms against dependencies that are known to be down. | circuit-breaker, failure-rate, protection |
| Retryability Classification | error-recovery | Inspecting each error's status and headers before entering the retry loop to avoid retrying unfixable errors or amplifying capacity events. | classification, retry, http-status |
| Tool Error Pipeline | error-recovery | Converting every tool execution failure into a `tool_result` message with `is_error: true` rather than an exception, keeping the conversation valid. | tool-error, pipeline, messages |
| Query-Source Partitioning | error-recovery | Routing foreground operations through full retry logic while failing background operations fast during capacity events, preventing amplification. | partitioning, foreground-background, retry |
| Adaptive Max-Tokens | error-recovery | Parsing exact token counts from a context overflow error and adjusting `max_tokens` for the retry rather than guessing or giving up. | context-overflow, tokens, adaptive |
| Persistent Retry Mode | error-recovery | Replacing fixed retry counts with a time-capped (6-hour) unlimited retry loop with heartbeat sleep-chunking for unattended automation sessions. | persistent, automation, unattended |

## Safety and Permissions

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Six-Source Permission Cascade | safety-and-permissions | Evaluating tool permissions through six ordered policy sources (policy → project → local → user → CLI → session) with first-match wins and fail-closed default. | cascade, policy, permissions |
| Bypass-Immune Checks | safety-and-permissions | Running scope bounds and critical checks before the cascade so they cannot be overridden by any policy source, mode, or rule. | bypass-immune, scope, safety |
| Five Permission Modes | safety-and-permissions | Providing a global semantic override on the cascade: default, plan (auto-deny writes), acceptEdits, bypassPermissions, and dontAsk (silent deny). | modes, global-override, permissions |
| Graduated Trust | safety-and-permissions | Assigning authority levels to instruction sources: system prompt > user turn > tool result > sub-agent output. | trust, hierarchy, authority |
| Denial Tracking | safety-and-permissions | Counting consecutive and total classifier denials and escalating to user dialog at thresholds, preventing silent infinite rejection loops. | denial-tracking, escalation, thresholds |
| Shadow Rule Detection | safety-and-permissions | Detecting at write time that a new allow rule can never fire because a broader deny or ask rule will always be checked first. | shadow-rules, detection, correctness |
| Racing Four Resolution Paths | safety-and-permissions | In interactive sessions, running hooks, classifier, bridge response, and channel relay concurrently so fast paths pre-empt slow ones before the dialog renders. | racing, concurrency, resolution |

## Multi-Agent Coordination

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Delegation Pattern | multi-agent-coordination | The coordinator decides WHAT, workers decide HOW — synthesis happens at the coordinator, never relay of raw worker output to the user. | delegation, synthesis, roles |
| Synthesis Over Relay | multi-agent-coordination | Making a dedicated LLM call to combine, resolve conflicts between, and integrate worker results into a single coherent answer. | synthesis, coordinator, llm-call |
| Tool Partitioning | multi-agent-coordination | Giving the coordinator only coordination tools (spawn, send, stop) and workers only domain tools, preventing coordinator bypass. | partitioning, tools, roles |
| Context Isolation | multi-agent-coordination | Starting every worker with a fresh message history so errors, confusion, and untrusted inputs are quarantined per worker. | isolation, context, security |
| File-Based Mailbox | multi-agent-coordination | Using a directory of atomic message files as the universal inter-agent communication channel across all executor backends. | mailbox, files, communication |
| Session Reconnection | multi-agent-coordination | Persisting team identity (agent name, team name, leader ID) in the session transcript so crashed workers rejoin without manual intervention. | reconnection, persistence, resilience |
| Three Executor Backends | multi-agent-coordination | Supporting in-process, multiplexer-based, and native terminal executors behind a single interface so the coordinator is backend-agnostic. | backends, executor, abstraction |

## Streaming and Events

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Discriminated Union Event Model | streaming-and-events | Defining the event contract between producer and consumer as a typed discriminated union (TextDelta, ToolDispatch, Complete, etc.). | discriminated-union, typed-events, contract |
| Producer-Consumer Pipeline | streaming-and-events | The agent loop as producer yielding typed events; consumers subscribing and handling event types they care about at their own pace. | producer-consumer, decoupling, pipeline |
| Bounded Buffer | streaming-and-events | Allowing the producer to run ahead by up to N events before blocking, balancing throughput with memory use and consumer crash-safety. | buffer, backpressure, bounded |
| Event Priority Scheduling | streaming-and-events | Dispatching discrete events (keystrokes) synchronously and batching continuous events (resize, scroll) to prevent input lag under load. | priority, scheduling, latency |
| Capture-Bubble Dispatch | streaming-and-events | Two-phase event routing in component trees: capture walks root-to-target (intercept), bubble walks target-to-root (react after). | capture-bubble, event-delegation, phases |
| Screen Diffing | streaming-and-events | Rendering to a screen buffer, diffing against the previous frame, and emitting only changed cells — not streaming text directly to the terminal. | screen-diff, rendering, terminal |

## Command and Plugin Systems

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Metadata-First Registration | command-and-plugin-systems | Declaring every command as a metadata object before any code is loaded; implementation deferred to invocation via dynamic imports. | metadata-first, lazy, registration |
| Three Command Types | command-and-plugin-systems | Classifying commands as local (function call), interactive (UI component), or prompt (conversation injection) via a discriminated union. | discriminated-union, command-types, dispatch |
| Lazy Loading | command-and-plugin-systems | Loading a command's module only when invoked, keeping startup cost constant regardless of registry size. | lazy-loading, startup, performance |
| Multi-Source Registry Merge | command-and-plugin-systems | Concatenating commands from built-in skills, plugin skills, plugin commands, workflow commands, and builtins — array order IS the priority. | multi-source, merge, priority |
| Availability vs Enabled Separation | command-and-plugin-systems | Keeping static availability (who can ever use this) separate from dynamic enabled state (is it on now) to support post-login auth refreshes. | availability, enabled, separation |

## Hooks and Extensions

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Four Execution Modes | hooks-and-extensions | Choosing hook execution as command, prompt, agent, or HTTP based on explicit cost-capability trade-offs. | execution-modes, cost, trade-offs |
| 27-Plus Lifecycle Events | hooks-and-extensions | Organizing extension points across six phases — session, per-turn, tool, memory, multi-agent, file system — covering the full operational surface. | lifecycle-events, phases, taxonomy |
| Condition Syntax | hooks-and-extensions | Using permission-rule pattern matching to gate hook execution at evaluation time, so non-matching hooks cost nothing to register. | conditions, pattern-matching, cost |
| Error Isolation and Aggregation | hooks-and-extensions | Capturing hook failures as four outcomes (success, blocking, non-blocking error, cancelled) so crashes never propagate to the main loop. | isolation, aggregation, resilience |
| Fire-and-Forget Async Hooks | hooks-and-extensions | Using `async: true` for audit and observability hooks so they add zero latency to the critical execution path. | async, fire-and-forget, observability |

## MCP Integration

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Tool Bridge Pattern | mcp-integration | Constructing standard Tool objects from MCP server capabilities at connection time so the dispatcher treats external tools identically to built-in tools. | tool-bridge, abstraction, integration |
| MCP Namespacing | mcp-integration | Naming bridged tools as `mcp__{server}__{tool}` to prevent cross-server collisions and make tool ownership traceable in logs. | namespacing, naming, traceability |
| Five Connection States | mcp-integration | Modeling server connections as connected, failed, needs-auth, pending, or disabled — all non-connected states return empty tool lists silently. | connection-states, state-machine, resilience |
| Batched Startup | mcp-integration | Using different concurrency limits for local servers (batch=3, process spawning) vs remote servers (batch=20, TCP connections) at startup. | batching, startup, concurrency |
| Tool Description Truncation | mcp-integration | Capping tool descriptions at 2048 characters to prevent OpenAPI-generated servers from exhausting the agent's context budget on every turn. | truncation, context-budget, protection |

## Observability and Debugging

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Three-Layer Observability | observability-and-debugging | Running structured event logging, cost tracking, and session tracing as three independent layers, each answering a different debugging question. | three-layers, independent, observability |
| Sink Queue Pattern | observability-and-debugging | Buffering events in an in-memory FIFO before the logging sink is ready, draining via microtask on attachment to prevent startup log loss. | sink-queue, startup-ordering, buffering |
| Metadata Type Restriction | observability-and-debugging | Accepting only boolean, number, and undefined in event metadata so accidental PII logging is a compile-time error rather than a runtime incident. | pii, type-restriction, safety |
| Four Token Types | observability-and-debugging | Tracking input, output, cache-read, and cache-creation tokens separately so prompt caching effectiveness is visible and measurable. | cost-tracking, tokens, cache |
| Context-Local Span Propagation | observability-and-debugging | Using async-local storage to carry the active span so deep call stacks can create child spans without threading a context parameter explicitly. | spans, async-local, propagation |
| Orphan Span Cleanup | observability-and-debugging | Using weak references and a background TTL sweep to end spans that were never closed due to aborted streams or unhandled exceptions. | orphan-spans, weak-references, cleanup |
| Privacy-Default Redaction | observability-and-debugging | Redacting all prompt content from traces and event logs by default, requiring explicit per-session opt-in for development debugging. | privacy, redaction, default-off |

## Advisor Strategy

| Pattern | Page | Description | Tags |
|---------|------|-------------|------|
| Executor-Advisor Split | advisor-strategy | Pairing a fast executor model (Sonnet/Haiku) with a frontier advisor (Opus) that intervenes only at hard decision points, targeting frontier intelligence where it has the most impact. | executor, advisor, multi-model, cost-optimization |
| max_uses Cap | advisor-strategy | Budgeting advisor invocations per task with a `max_uses` parameter so the executor treats the advisor as a scarce resource and self-resolves before escalating. | max-uses, cost-control, budget |
| Decision-Point Escalation | advisor-strategy | The executor consulting the advisor specifically at ambiguous errors, multi-path trade-offs, high-stakes irreversible actions, and novel domains — not on routine steps. | escalation, decision-forks, routing |
| Advisor Constraints | advisor-strategy | Restricting the advisor to no tool calls, no user-facing output, and short guidance-only responses so it supplements the executor without taking over execution. | constraints, guidance, scoping |


---

# Glossary

Source: https://claudepedia.dev/docs/glossary
Section: Reference

Definitions of every domain-specific term used across ClaudePedia, with cross-references to the page that explains each term in depth.

This glossary defines domain-specific terms used across ClaudePedia. General programming vocabulary (async, function, class) is intentionally excluded — every term here has a meaning that is specific to agent system design or that has a specialized meaning in the agent context that differs from general use. Each entry links to the primary page that explains the term most thoroughly.

---

### Abort Signal

A cancellation token passed into the agent loop and all async operations it initiates (model calls, tool executions, hook invocations). When the abort signal fires, outstanding operations must complete their protocol obligations — including emitting synthetic tool results — before exiting cleanly. See: [Agent Loop Architecture](/docs/agent-loop)

### Adaptive Max-Tokens

A recovery technique for context overflow errors where the exact token counts are parsed from the API error response and `max_tokens` is set precisely for the retry, rather than guessing. Prevents repeated 400 errors while avoiding the cost of unnecessarily reducing output budget. See: [Error Recovery and Resilience](/docs/error-recovery)

### Agent Event Stream

The sequence of typed events yielded by the agent loop: `RequestStart`, `TextDelta`, `ToolDispatch`, `ToolResult`, `Complete`, `ErrorEvent`. Distinct from the terminal input event system (keystrokes, resize). See: [Streaming and Events](/docs/streaming-and-events)

### Append-Tail

A mechanism for injecting content at the end of the assembled system prompt regardless of which priority chain level won — used for memory correction hints, team policy additions, and per-session overrides without modifying any static section. See: [Prompt Architecture](/docs/prompt-architecture)

### Backpressure

The pressure a slow consumer exerts on a fast producer in a streaming pipeline. Handled through no-buffer (blocking producer), bounded buffer (producer runs ahead up to N), or unbounded buffer (producer never blocks) strategies. See: [Streaming and Events](/docs/streaming-and-events)

### Behavioral Flags

Metadata fields on a tool that declare cross-cutting concerns: `is_concurrency_safe`, `is_read_only`, `is_destructive`, `interrupt_behavior`, `requires_user_interaction`. Read by different subsystems (dispatcher, permission system, UI) without the tool implementation knowing about those systems. See: [Tool System Design](/docs/tool-system)

### Bypass-Immune Check

A safety check that runs before the permission cascade and cannot be overridden by any policy source, permission mode, or rule configuration. Scope bounds checking is the canonical example. See: [Safety and Permissions](/docs/safety-and-permissions)

### Cache Fragmentation

The exponential multiplication of prefix cache keys when session-variable content is interleaved in the static zone of the system prompt. With N variable bits in the static zone, 2^N possible cache keys exist. See: [Prompt Architecture](/docs/prompt-architecture)

### Capture-Bubble Dispatch

A two-phase event routing model for component trees: the capture phase walks root-to-target (ancestors can intercept before the target sees the event), and the bubble phase walks target-to-root (ancestors react after the target handles it). See: [Streaming and Events](/docs/streaming-and-events)

### Circuit Breaker

A state machine wrapper around a service call that tracks failure rates and blocks calls when the failure rate exceeds a threshold, preventing retry storms against a known-down service. Transitions through Closed → Open → Half-Open states. See: [Error Recovery and Resilience](/docs/error-recovery)

### Closed Taxonomy

A fixed, finite set of allowed memory types (user, feedback, project, reference) that constrains what the extraction agent is allowed to create, preventing the memory store from becoming an unstructured junk drawer. See: [Memory and Context](/docs/memory-and-context)

### Command Type

One of three discriminated union variants for commands: `local` (function call returning a result), `interactive` (renders a UI component), or `prompt` (injects content blocks into the conversation). See: [Command and Plugin Systems](/docs/command-and-plugin-systems)

### Compaction

The process of reducing the context window size when the message history approaches the limit. Runs through a cost-ordered pipeline: trim tool results → drop old messages → session memory compact → LLM summarization. See: [Memory and Context](/docs/memory-and-context)

### Concurrency Class

A classification of a tool's parallelism safety: `READ_ONLY` (safe for concurrent dispatch), `WRITE_EXCLUSIVE` (must run serially), or `UNSAFE` (must run serially and in isolation). Determined at dispatch time from the actual parsed arguments, not at registration time. See: [Tool System Design](/docs/tool-system)

### Context Window

The fixed-size input buffer that holds everything the model can see in a single turn: conversation history, tool results, injected facts, system instructions. Agent memory management is the art of deciding what occupies this finite space. See: [Memory and Context](/docs/memory-and-context)

### Coordinator

An agent in a multi-agent system that receives the user's task, breaks it into subtasks, delegates to workers, and synthesizes the results into a single coherent answer. The coordinator decides WHAT; workers decide HOW. See: [Multi-Agent Coordination](/docs/multi-agent-coordination)

### Cursor (extraction)

A UUID identifying the last message successfully processed by the background fact extraction agent. Advances only on success, so failed runs reconsider the same messages on the next attempt. See: [Memory and Context](/docs/memory-and-context)

### Denial Tracking

Counting consecutive and total classifier denials per session and escalating to a user dialog when thresholds are crossed, preventing silent infinite rejection loops in classifier-based permission systems. See: [Safety and Permissions](/docs/safety-and-permissions)

### Discriminated Union

A type pattern where a tagged field (e.g., `type: "TextDelta"`) determines which variant the value is, enabling exhaustive pattern matching by consumers. Used for both events and transitions throughout the agent architecture. See: [Streaming and Events](/docs/streaming-and-events)

### Dynamic Zone

The portion of the system prompt that changes per-session or per-turn: session context, user memory, active tools, token budget. Content in this zone breaks prefix cache keys and must be updated each turn. See: [Prompt Architecture](/docs/prompt-architecture)

### Escalation Ladder

A four-rung failure response in error recovery: retry the same operation (cost: latency) → fall back to an alternative implementation (cost: quality) → degrade by removing the capability (cost: lost feature) → fail entirely (cost: lost task). See: [Error Recovery and Resilience](/docs/error-recovery)

### Fact Extraction

The process of identifying and persisting domain-specific facts from conversation messages to long-term storage. Run by the background extraction agent after each turn, gated on the mutual exclusion guard. See: [Memory and Context](/docs/memory-and-context)

### Fail-Closed Default

The design principle that any unset safety-critical configuration value defaults to the most restrictive behavior. A missing `is_concurrency_safe` defaults to `false`; a missing `requires_permission` defaults to `true`. See: [Tool System Design](/docs/tool-system)

### Graduated Trust

A hierarchy of instruction authority: system prompt (highest) > user turn > tool result > sub-agent output (lowest). An agent cannot grant itself elevated permissions; trust flows downward and cannot be escalated by lower-authority sources. See: [Safety and Permissions](/docs/safety-and-permissions)

### Hierarchy of Forgetting

A four-level memory model where each level down trades fidelity for space: in-context (perfect fidelity) → summary (compressed digest) → long-term storage (extracted facts) → forgotten (discarded). See: [Memory and Context](/docs/memory-and-context)

### Hook

A typed interceptor registered against a named lifecycle event. Hooks are configured as data — the hook runner evaluates their conditions and dispatches to the appropriate execution mode without the main loop knowing the details. See: [Hooks and Extension Points](/docs/hooks-and-extensions)

### Interrupt Behavior

A per-tool flag declaring what happens when a user submits a new message while the tool is running: `cancel` (stop immediately) or `block` (finish before processing the new message). See: [Tool System Design](/docs/tool-system)

### Jitter

Random variation added to exponential backoff delays to prevent synchronized retry storms when many clients fail simultaneously. Standard formula: 25% random variation applied to the computed delay. See: [Error Recovery and Resilience](/docs/error-recovery)

### Lazy Loading

Deferring the import of a command's implementation module until the command is invoked, keeping registry startup cost constant regardless of how many commands exist. See: [Command and Plugin Systems](/docs/command-and-plugin-systems)

### Mailbox

A directory of atomic message files used as the inter-agent communication channel across all executor backends. Provides uniformity (same interface for in-process and separate-process workers) and disk-inspectable message history. See: [Multi-Agent Coordination](/docs/multi-agent-coordination)

### Max-Turns

A hard iteration limit on the agent loop, serving as a correctness requirement and circuit breaker. Not a preference — exceeding it signals a task design problem and should surface as an error, not a silent return. See: [Agent Loop Architecture](/docs/agent-loop)

### MCP (Model Context Protocol)

A protocol for integrating external services into an agent's tool system via a standard capability discovery API (`tools/list`). The agent-side client bridges MCP tools into structurally identical Tool objects, making external tools invisible to the dispatcher. See: [MCP Integration](/docs/mcp-integration)

### Metadata-First Registration

A command registry pattern where each command is described as a metadata object before any code is loaded; implementation is deferred to invocation via lazy dynamic imports. See: [Command and Plugin Systems](/docs/command-and-plugin-systems)

### Model Fallback

Switching to a different model after repeated capacity errors on the primary model. Distinct from a circuit breaker: circuit breakers block calls to a failing service; model fallback changes which model handles the same request. See: [Error Recovery and Resilience](/docs/error-recovery)

### Partition-Then-Gather

The tool dispatch algorithm that splits a list of tool calls into consecutive batches of safe/unsafe tools, processes batches sequentially, and parallelizes execution within each safe batch. See: [Tool System Design](/docs/tool-system)

### Permission Cascade

The six-source evaluation chain for permission decisions: policySettings → projectSettings → localSettings → userSettings → cliArg → session. The first matching rule wins; no match means DENY. See: [Safety and Permissions](/docs/safety-and-permissions)

### Permission Mode

A global semantic override on the permission cascade: `default` (ask for unlisted tools), `plan` (auto-deny writes), `acceptEdits` (auto-approve file edits), `bypassPermissions` (auto-approve all), `dontAsk` (silently deny unlisted). See: [Safety and Permissions](/docs/safety-and-permissions)

### Prompt Cache

A provider-side prefix cache that recognizes byte-identical prompt prefixes across API calls and serves them at reduced cost. Effective only when the static zone is identical across users, sessions, and turns. See: [Prompt Architecture](/docs/prompt-architecture)

### Query-Source Partitioning

Classifying operations as foreground (user is waiting — retry on capacity errors) or background (user never sees results — fail fast on capacity errors) to prevent background retries from amplifying capacity events. See: [Error Recovery and Resilience](/docs/error-recovery)

### Screen Diffing

The rendering model where a screen buffer is computed each frame, diffed against the previous frame, and only changed cells are emitted as terminal escape codes — preventing partial-render corruption and enabling frame-rate-controlled output. See: [Streaming and Events](/docs/streaming-and-events)

### Section Registry

A prompt assembly system where each section is registered with an explicit cache intent (cached or volatile) rather than concatenated directly. Enables auditing of cache-breaking sections and enforces the static/dynamic boundary structurally. See: [Prompt Architecture](/docs/prompt-architecture)

### Session Memory

Structured in-memory context that can serve as a zero-LLM-cost summary during compaction, replacing the need for a full LLM summarization call if session memory is available and non-empty. See: [Memory and Context](/docs/memory-and-context)

### Shadow Rule

A permission rule that can never be reached because a broader deny or ask rule is checked first. Shadow rules are detected at write time (when rules are added) rather than at evaluation time. See: [Safety and Permissions](/docs/safety-and-permissions)

### Sink Queue Pattern

An in-memory FIFO buffer that accumulates events before the logging sink is ready, drained via microtask on sink attachment. Prevents log loss during the window between agent startup and sink initialization. See: [Observability and Debugging](/docs/observability-and-debugging)

### SSRF Protection

Server-Side Request Forgery protection for HTTP hooks: blocking requests to private IP ranges, cloud metadata endpoints, and CGNAT ranges, validated at DNS resolution time to prevent DNS rebinding attacks. See: [Hooks and Extension Points](/docs/hooks-and-extensions)

### State Struct Pattern

Carrying all mutable agent loop state in a typed struct that is replaced wholesale at every continue site, with a typed `transition.reason` field that makes continuation causes auditable. See: [Agent Loop Architecture](/docs/agent-loop)

### Static Zone

The portion of the system prompt that is identical for every user, session, and turn: identity, behavioral rules, tool descriptions, numeric calibration. Content in this zone can be prefix-cached. See: [Prompt Architecture](/docs/prompt-architecture)

### Stop Hook

A hook registered on the `Stop` lifecycle event that can evaluate an agent response, inject blocking messages that cause the loop to re-enter, or prevent continuation entirely. See: [Agent Loop Architecture](/docs/agent-loop)

### Synthesis (multi-agent)

The LLM call a coordinator makes after all workers return, combining partial results, resolving conflicts, filling gaps, and producing a single integrated answer. Distinguishes a coordinator from a router. See: [Multi-Agent Coordination](/docs/multi-agent-coordination)

### Three-Handler Architecture

The three permission resolution paths selected based on execution context: interactive handler (races four paths concurrently), coordinator handler (sequential), and swarm worker handler (forwards to leader via mailbox). See: [Safety and Permissions](/docs/safety-and-permissions)

### Tombstone

A marker placed on an orphaned partial assistant message that flags it for removal from history, UI rendering, and transcript serialization. Targeted removal that preserves surrounding context, unlike truncation. See: [Agent Loop Architecture](/docs/agent-loop)

### Tool Bridge

The pattern of constructing standard Tool objects from MCP server capabilities at connection time, making external tools structurally identical to built-in tools from the dispatcher's perspective. See: [MCP Integration](/docs/mcp-integration)

### Tool Error Pipeline

The invariant that every tool execution failure yields a `tool_result` message with `is_error: true` rather than raising an exception, ensuring the message stream always has a valid response for every `tool_use`. See: [Error Recovery and Resilience](/docs/error-recovery)

### Tool Partitioning

Giving the coordinator agent only coordination tools (spawn, send, stop) and workers only domain tools, preventing the coordinator from bypassing delegation and doing the work itself. See: [Multi-Agent Coordination](/docs/multi-agent-coordination)

### Two-Phase Validation

Running schema validation (shape and type checking) followed by semantic validation (business logic) as two separate, always-running phases in the tool execution lifecycle. See: [Tool System Design](/docs/tool-system)

### Two-State Machine

The agent loop's core model: the loop alternates between two states — awaiting a model response and dispatching tool calls. If the model returns tool calls, stay in the loop; if not, exit. See: [Agent Loop Architecture](/docs/agent-loop)

### Two-Zone Model

The architectural split of the system prompt into a static zone (cacheable, identical across users/sessions/turns) and a dynamic zone (per-session or per-turn, cache-breaking). See: [Prompt Architecture](/docs/prompt-architecture)

### Verified-Metadata Contract

A logging pattern that requires explicit acknowledgment when logging string values in event metadata, making PII logging a deliberate, reviewable act rather than an accidental side effect. See: [Observability and Debugging](/docs/observability-and-debugging)

### Volatile Section

A prompt section registered with a `reason` argument indicating why it cannot be cached. The verbose registration variant creates friction that prevents accidental cache fragmentation in team codebases. See: [Prompt Architecture](/docs/prompt-architecture)

### Worker (multi-agent)

An agent in a multi-agent system that receives a narrow, well-defined subtask from the coordinator, executes it with isolated context and domain-specific tools, and returns results for synthesis. See: [Multi-Agent Coordination](/docs/multi-agent-coordination)


---