Error Recovery and Resilience

Tool calls fail. Networks drop packets, APIs return 429s, services go down, code has bugs. An agent without a recovery strategy has two failure modes: it retries forever (running up costs and blocking the task indefinitely), or it crashes on the first error and returns nothing. Neither is acceptable in production.

The question isn't whether failures happen. They will. The question is what the system does next. The answer is an escalation ladder: four responses to failure, ordered by cost, applied in sequence from cheapest to most extreme. Understanding the ladder (what each rung handles, what it costs, and when to stop climbing) is the core mental model for building resilient agents.

But the ladder is not the whole story. A production error recovery system also needs to know which errors are worth retrying at all, who is waiting for the result, and what happens when the failure is inside a tool rather than in the API request. This page covers all three: the escalation ladder, retryability classification, query-source partitioning, the tool error pipeline, and the production details that separate a resilient agent from one that just retries everything.

The Escalation Ladder

When a tool call fails, we have four options:

Retry: same operation, same path, wait and try again. Cost: latency. Use for transient failures: network blips, rate limits, momentarily overloaded services.
Fallback: different implementation of the same capability. Cost: reduced quality or slower path. Use when the primary path has permanently failed.
Degrade: remove the capability from this session's available tools. Cost: lost feature. Use when the fallback has also failed and the task can still complete without this capability.
Fail: stop entirely and return an error. Cost: lost task. Use when continuing would cause more harm than stopping, or when there is no way forward.

Here is the tiered recovery function in pseudocode:

function execute_with_recovery(operation, config):
  # Rung 1: retry (cost: latency)
  for attempt in range(config.max_retries):
    result = await try_operation(operation)
    if result.succeeded:
      return result
    if not result.is_retryable:
      break
    await backoff(attempt, config.base_delay_ms)

  # Rung 2: fallback (cost: reduced quality)
  if config.fallback is not None:
    result = await try_operation(config.fallback)
    if result.succeeded:
      return result

  # Rung 3: degrade (cost: lost capability)
  if config.is_optional:
    log_degradation(operation.name, "skipping for session")
    return DegradedResult(capability=operation.name)

  # Rung 4: fail (cost: lost task)
  raise RecoveryExhausted(operation=operation.name, last_error=result.error)

Each rung has concrete semantics worth naming:

Retry with backoff. Exponential backoff with jitter is the standard implementation, not linear delay and not a fixed interval. A fixed 1-second delay under load still overwhelms a struggling service. Jitter spreads the retry storm. The full jitter formula is in Production Considerations below. Check is_retryable before entering the retry loop. A 400 Bad Request is generally not retryable, while a 503 Service Unavailable is. The full classification logic is in the section on which errors are retryable below.

Fallback path. The fallback is a different implementation of the same capability: a smaller model, a slower API, a cached result, a heuristic approximation. The fallback's contract is: it produces something useful, but not as good as the primary path. Document that degradation explicitly. If the fallback is "pretend the tool succeeded", we've hidden the failure, not recovered from it.

Degrade gracefully. Degradation means the agent continues the task without this capability. This is only safe when the task is designed to tolerate partial tool sets. Before deploying an agent, think through which tools are essential (no degradation path) and which are optional (degradation is acceptable). Failing to make this distinction upfront means the degradation logic will be wrong in production.

Fail cleanly. A clean failure is better than a confused recovery. When raising at the bottom of the ladder, include the operation name, the last error, and the number of attempts made. The caller (a coordinator, a user, or a monitoring system) needs that information to decide what to do next.

Circuit Breakers

Tiered recovery handles individual failures. Circuit breakers handle patterns of failure.

Without a circuit breaker, a service that's completely down will cause every call to exhaust its full retry budget before escalating. If fifty tool calls are queued, each retrying three times with backoff, we've turned one failure into a hundred failed calls and a multi-minute delay. Circuit breakers prevent this.

A circuit breaker wraps a service and maintains three states:

function call_with_circuit_breaker(service, request, breaker):
  if breaker.state == OPEN:
    raise CircuitOpen(service=service.name, retry_at=breaker.retry_after)

  result = await service.call(request)

  if result.succeeded:
    breaker.record_success()
    if breaker.state == HALF_OPEN:
      breaker.close()               # service recovered, reset to closed
  else:
    breaker.record_failure()
    if breaker.failure_rate > breaker.threshold:
      breaker.open(retry_after=now() + breaker.cooldown)

  return result

The three states:

Closed: normal operation. Calls go through. Failures are counted against the threshold.
Open: the service is failing. Calls are rejected immediately without attempting the service. This prevents the retry storm.
Half-open: the cooldown has elapsed. One probe call is allowed through. If it succeeds, the circuit closes. If it fails, the circuit reopens and the cooldown resets.

The circuit breaker belongs above the tiered recovery function. It's the first thing evaluated. If the circuit is open, skip straight to fallback or degrade, without burning retry budget on a service that's known to be down.

One important distinction: a circuit breaker changes state (it blocks future calls until the service recovers). A model fallback, covered in Production Considerations, changes identity. It switches to a different model rather than blocking calls. They solve different problems and can coexist in the same system.

Fail-Closed Defaults

The escalation ladder assumes our system is fail-closed by default. When no recovery policy is defined for a given failure, the system defaults to the most restrictive behavior. This is the same asymmetric cost argument from tool system design: the cost of treating a safe operation as unsafe is a small performance hit, while the cost of treating an unsafe operation as safe is data corruption or worse. When writing recovery configuration, missing values should default to no-retry, no-fallback, mandatory failure, not to unlimited retries with permissive fallbacks.

A build_tool factory that enforces this centrally is cleaner than scattering defaults across tool definitions:

function build_tool(name: str, handler, config: ToolConfig) -> Tool:
  return Tool(
    name: name,
    handler: handler,
    # fail-closed recovery defaults:
    max_retries: config.max_retries ?? 0,       # default: no retries
    fallback: config.fallback ?? None,           # default: no fallback
    is_optional: config.is_optional ?? false,    # default: required (no degrade)
    require_permission: config.require_permission ?? true,  # default: ask
    is_destructive: config.is_destructive ?? true,          # default: assume harmful
  )

When is_optional is absent, the tool is treated as essential. It fails rather than degrades. When max_retries is absent, there are no retries. This is the same fail-closed logic that makes tool registration safe: the defaults are the most restrictive choices, and configuration is opt-in.

See Tool System Design for the full treatment of fail-closed defaults and why the asymmetry makes them a safety requirement, not just a convention.

Which Errors Are Retryable?

Not all errors should be retried. Entering the retry loop without first asking "is this error fixable?" wastes latency budget and can cause harm. Retrying a malformed request doesn't fix the malformation. It just produces N copies of the same failure.

A production retry system classifies each error before attempting recovery. The classification decision tree:

Status	Retry behavior	Reason
400 (bad request)	Only if context overflow, and only after adjusting max_tokens	Other 400s are not fixable: the request is malformed
401 (auth)	Yes, after refreshing credentials	The API key cache may be stale. Clearing it is worth one retry.
408 (timeout)	Always	Transient: the service may have recovered
409 (conflict)	Always	Transient lock contention: retry with backoff
429 (rate limit)	Enterprise/PAYG only	Subscription users hit window-based limits (not retryable within the window)
529 (overloaded)	Foreground operations only	Background operations should fail fast (see Query-Source Partitioning)
5xx (server error)	Yes, unless server says no	Respect `x-should-retry: false` header. The server knows its state.
Connection reset	Yes, after disabling keep-alive	ECONNRESET/EPIPE often indicate the connection was reused after expiry

In pseudocode, the retryability function looks like this:

function should_retry(error, attempt, max_retries, query_source):
  # Never retry beyond the limit (unless persistent mode)
  if attempt > max_retries:
    raise NonRetryableError(error)

  # 400: only retry if it's a context overflow we can fix
  if error.status == 400:
    overflow = parse_context_overflow(error)
    if overflow:
      adjust_max_tokens(overflow.available_context)
      return true  # retry with adjusted max_tokens
    raise NonRetryableError(error)  # other 400s are not fixable

  # 529 (overloaded): only retry foreground operations
  if error.status == 529:
    if query_source not in FOREGROUND_RETRY_SOURCES:
      raise NonRetryableError(error)  # background: fail fast, no amplification

  # 429 (rate limit): check subscription tier
  if error.status == 429:
    if is_subscription_user and not is_enterprise:
      raise NonRetryableError(error)  # subscription users have window limits

  # 401: refresh credentials, then retry
  if error.status == 401:
    refresh_credentials()
    return true

  # 5xx, 408, 409, connection errors: generally retryable
  if error.status in {408, 409} or error.status >= 500:
    if error.headers.get("x-should-retry") == "false":
      raise NonRetryableError(error)  # server directive overrides our logic
    return true

  return false

Adaptive max_tokens on context overflow. The 400 path deserves special attention. When a context overflow error occurs, the API response includes the actual token counts: how many tokens the input contained, the model's maximum, and how much space remains. We can parse these values from the error message, compute the available output budget, and reduce max_tokens for the retry. This converts what looks like a fatal error into a recoverable one without needing to compact or truncate the conversation.

The floor matters: if the available output budget is below ~3,000 tokens, attempting the retry would produce a response too short to be useful. At that point, fail rather than retry with an unusably small response budget. Adaptive max_tokens is a way to squeeze more life out of a long conversation. It is not a substitute for proper context management.

The Tool Error Pipeline

Tool execution has its own error model, distinct from the API retry system. The key distinction: tool errors become messages, not exceptions.

In the API retry system, a failure causes a delay and a retry. In the tool error pipeline, a failure yields a tool_result message into the conversation history. The model reads that message on the next turn and can adapt: try a different tool, ask the user for input, or reformulate the request. The agent loop continues.

This is the invariant that makes tool errors recoverable: every tool_use must have a matching tool_result, even on failure. A conversation with a dangling tool_use and no tool_result is invalid. The API will reject it. The tool error pipeline ensures that invariant is maintained regardless of what goes wrong during execution.

There are four tool error paths:

async function run_tool(tool_call, tool, context):
  # Path 1: Unknown tool, yield error message, continue
  if not tool:
    yield create_tool_result(
      tool_use_id: tool_call.id,
      content: f"Error: No tool named '{tool_call.name}' is available",
      is_error: true
    )
    return  # loop continues: model sees the error and can adapt

  # Path 2: Abort, yield cancel message, return cleanly
  if context.abort_signal.aborted:
    yield create_tool_result(
      tool_use_id: tool_call.id,
      content: "Operation cancelled",
      is_error: false
    )
    return  # abort propagates through message history without corruption

  # Path 3 and 4: Permission check + execution
  permission = check_permission(tool, tool_call.input)
  if permission.denied:
    yield create_tool_result(            # Path 3: Permission denied
      tool_use_id: tool_call.id,
      content: f"Permission denied: {permission.reason}",
      is_error: true
    )
    return  # model can reframe the request or ask the user

  try:
    async for result in execute_tool(tool, tool_call, context):
      yield result
  except error:
    yield create_tool_result(            # Path 4: Execution failure
      tool_use_id: tool_call.id,
      content: f"Tool '{tool.name}' failed: {error.message}",
      is_error: true
    )
    # No re-raise: the loop continues with complete message history

The four paths:

Unknown tool: the model requested a tool that doesn't exist in the current tool set. Yield an error tool_result. The model sees this and can adapt: try a different tool, reformulate the request, or ask the user.
Abort: the user or coordinator cancelled the operation mid-execution. Yield a cancel tool_result and return cleanly. Abort propagates through the message history without corruption. The conversation remains valid.
Permission denied: the permission middleware rejected the call. Yield a tool_result with the denial reason. The model can reframe the request or ask the user for explicit permission. This is not the same as an execution failure because the tool was never called.
Execution failure: the tool ran but threw an error. Catch at the outer boundary, yield a tool_result with error detail. No re-raise. The loop continues. The model has complete context for the next turn: it knows what it tried, why it failed, and can choose a different path.

The unifying principle: the agent loop's message stream must never contain a gap. Every tool_use has a tool_result, even on failure. This is what makes tool errors recoverable by design rather than by luck.

Query-Source Partitioning

Not all operations deserve the same retry behavior during capacity events. When a rate-limit or overload error occurs, treating all operations equally (retrying everything) can amplify the failure rather than recover from it.

The problem: background operations (title generation, confidence scoring, suggestion ranking) run in parallel with the main agent loop. If a capacity event causes 529 errors, and each background operation retries three times, and there are N operations running in parallel, the cascade doesn't self-heal. It gets N times worse. The original failure triggers N times max_retries additional requests against an already overwhelmed service.

The solution: partition operations by who is waiting for the result.

Foreground operations: the user is blocking on the result. These are worth retrying on capacity errors because the user experience degrades visibly if they fail.
Background operations: the user never sees these results directly. These should fail fast on capacity errors. The cost of failure is invisible to the user. The cost of retrying is paid by the service under load.

The partition should be explicit: a foreground allowlist. Everything not on the allowlist defaults to fail-fast. This is the fail-closed principle applied to retry policy. Don't assume an operation deserves retries. Require it to be declared.

FOREGROUND_RETRY_SOURCES = {
  "main_agent",
  "user_request",
  "coordinator_task",
}

function get_retry_policy(query_source: str) -> RetryPolicy:
  if query_source in FOREGROUND_RETRY_SOURCES:
    return RetryPolicy(
      retry_on_capacity_error: true,
      max_retries: 3,
    )
  else:
    return RetryPolicy(
      retry_on_capacity_error: false,  # fail fast, no amplification
      max_retries: 0,
    )

The insight here is asymmetric: the benefit of retrying a background operation is low (the user doesn't see the result), and the cost is high (it amplifies capacity events). The default for anything not explicitly in the foreground allowlist is no retry on capacity errors. Adding operations to the foreground list is a deliberate act. It means "the user is blocking on this result and deserves a retry."

Production Considerations

The jitter formula with real numbers. The standard exponential backoff formula with jitter:

base_delay = min(500ms * 2^(attempt - 1), 32s)
jitter = random() * 0.25 * base_delay
total_delay = base_delay + jitter

With 500ms base and 32s maximum: attempt 1 = 500ms plus or minus 125ms, attempt 2 = 1s plus or minus 250ms, attempt 3 = 2s plus or minus 500ms, eventually capping at 32s plus or minus 8s. The 25% jitter range spreads retries across a 2-second window at maximum delay rather than a synchronized thundering herd. Fixed delays without jitter cause retry storms: if a hundred clients all fail at the same moment and all retry after exactly 1 second, they all hit the recovering service at exactly the same moment. The second wave is as bad as the first.

The server can override this calculation entirely. When the response includes a retry-after header, honor it. The server knows its own cooldown period better than any client formula does. Use the server-provided delay instead of the backoff calculation. The exception: the 6-hour session cap still applies in persistent retry mode (below) to prevent a pathological retry-after value from waiting indefinitely.

Persistent retry mode for unattended sessions. For CI pipelines and automated sessions where no user is watching, a persistent retry mode with unlimited retries and a hard time cap (6 hours) is more appropriate than a fixed retry count. The challenge: if the next retry is 30 minutes away (service outage), simply sleeping for 30 minutes will cause the host environment to mark the session idle and terminate it.

The solution: break long sleeps into short chunks (30 seconds each), and yield a heartbeat event to the host environment after each chunk. This keeps the session alive during extended waits. When the retry-after header specifies a long wait, chunk it:

function sleep_with_heartbeat(delay_ms: int, heartbeat_fn):
  chunk_ms = 30_000  # 30-second chunks
  remaining = delay_ms
  while remaining > 0:
    sleep(min(chunk_ms, remaining))
    heartbeat_fn()  # keeps the session alive
    remaining -= chunk_ms

The 6-hour cap is the critical safety valve. Without it, an automated session could wait indefinitely if the service never recovers. With the cap, the session eventually gives up and reports failure. The operator can investigate and retry manually.

Model fallback on repeated capacity errors. After N consecutive overload errors on the primary model, trigger a fallback to a different model. This is distinct from a circuit breaker: the circuit breaker prevents calls to a failing service, while the model fallback switches to a different model identity. The fallback model still serves the same request. It just uses a different model name.

The fallback trigger is a specific signal (not a generic error) because the caller needs to handle it differently: switch the model name and retry the original request, rather than entering the standard retry loop. The primary model circuit breaker and the model fallback are orthogonal. Both can be active at the same time.

Parse error messages for exact token counts. When a context overflow error occurs, the API response includes the precise token counts: how many tokens the input contained, the model's maximum, and the available output budget. Do not guess these values. Parse them from the error message, compute the available space, and set max_tokens exactly for the retry. Guessing too high repeats the 400. Guessing too low wastes output budget. The API is giving us the exact answer. Use it.

Best Practices

DO classify errors before retrying. Not all errors are retryable. A 400 bad request is a bug in our request, not a transient failure.
DO partition operations into foreground (retry on capacity errors) and background (fail fast on capacity errors). The default should be fail fast.
DO use jitter in exponential backoff. Fixed delays cause synchronized retry storms when many clients fail simultaneously.
DO yield error tool_result messages instead of raising exceptions from tool execution. The message stream must remain valid for the next turn.
DO honor the server's retry-after header when present. The server's timing beats our formula.
DO have a hard time cap on persistent retry mode (6 hours prevents unbounded waiting in unattended sessions).
DO break long sleeps into 30-second chunks with heartbeats in persistent mode. Host environments terminate idle sessions.
DON'T retry 400 errors unless we can fix the request. Parse the error, check if it's a context overflow, and only retry if max_tokens adjustment is possible.
DON'T retry background operations during capacity cascades. Gateway amplification turns one failure into N times max_retries failures.
DON'T let tool errors bubble up as exceptions. Every tool_use must have a matching tool_result, even on failure.
DON'T set the adaptive max_tokens floor too low. Below ~3,000 tokens, the response is too short to be useful. Fail rather than produce useless output.
DON'T conflate model fallback with circuit breaking. They solve different problems (identity vs. state) and can coexist.

Tool System Design: Fail-closed defaults originated in tool metadata design: when safety flags are missing, the system defaults to the most restrictive interpretation. That same principle applies to recovery configuration. Also covers the tool lifecycle, which connects directly to the tool error pipeline because execution failure handling is part of the dispatch boundary.
Agent Loop: The agent loop must survive tool errors. The tool error pipeline keeps the loop running by maintaining a valid message stream even when tools fail. Understanding the loop's turn structure clarifies why a dangling tool_use without a tool_result breaks everything.
Memory and Context: The autocompact circuit breaker uses the same consecutive-failure pattern as the API circuit breaker: after 3 consecutive compaction failures, autocompact is disabled for the session. Context overflow errors and adaptive max_tokens adjustment are the two systems' meeting point.
Prompt Architecture: Retry behavior and error thresholds can be calibrated in the system prompt. Prompt design affects how the model responds to tool errors in its message history. A well-designed identity section helps the model adapt when tools fail rather than spinning or giving up.
Safety and Permissions: Permission denials are a common error condition that the recovery system must handle. The denial tracking threshold (3 consecutive / 20 total) triggers mode escalation, which feeds back into the error recovery pipeline.
Observability and Debugging: The event log records tool errors, hook failures, and permission denials as first-class events. These logged signals are the raw data that error recovery classifies as retryable or permanent.
Pattern Index: All patterns from this page in one searchable list, with context tags and links back to the originating section.
Glossary: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.