Agent Loop Architecture

The agent loop is what makes an LLM do things instead of just talk. Each turn, the model decides: call a tool, or stop. That decision, and the code that enforces it, is the beating heart of every agent system. Without it, you have a chatbot. With it, you have an agent that can read files, call APIs, run computations, and deliver results the model couldn't produce on its own.

Understanding the agent loop isn't optional. It's the foundation everything else rests on. Every other pattern in this knowledge base (tool dispatch, context compaction, multi-agent coordination) only makes sense once you have a clear mental model of how the loop works and why it's structured the way it is.

The agent loop is a two-state machine with exactly two states:

Awaiting model response: we send the full message history to the model and wait.
Dispatching tools: the model returned tool calls, so we run them and append the results.

If the model returns tool calls, we stay in the loop. If it returns no tool calls, we exit. That's the whole model.

This structure makes tool-using agents deterministic to reason about. The model's intent (call a tool or stop) is explicit in every response. There's no hidden state, no implicit routing. Just two alternating phases driven by what the model decides to do next.

Think of the message list as a shared ledger between you and the model. Every turn, the model reads the full ledger and writes its response. When we run tools, we add the results to the ledger before the next turn. The model can only act on what it can see, so the ledger is the agent's entire working memory.

The Generator Pattern

The right implementation structure for an agent loop is an async generator. Generators yield each completed turn as it happens, letting callers stream progress without coupling to the loop's internals. A caller can process each turn however it wants (stream events to a UI, wire in a supervisor, or just collect results) without the loop changing at all.

Here is a minimal but complete agent loop in pseudocode:

async def agent_loop(messages: list[Message], tools: list[Tool], max_turns: int = 20) -> AsyncIterator[Response]:
  try:
    for turn_number in range(max_turns):
      response = await llm.call(messages, available_tools=tools)
      yield response                    # stream progress: caller observes each turn

      if response.tool_calls is empty:
        return                          # model is done, no tool calls means final answer

      messages.append(response)         # commit model response before dispatching
      for call in response.tool_calls:
        result = await tools[call.name].run(call.arguments)
        messages.append(tool_result(id=call.id, content=result))

    raise LoopError("exceeded max_turns without completion")
  finally:
    await cleanup()                     # always runs, regardless of how the caller exits

Three decisions embedded in this snippet are worth making explicit:

Why yield? Yielding each response means callers can observe the agent's reasoning as it unfolds, not just the final answer. This is how you build streaming UIs, per-turn logging, and supervisor agents that intervene mid-run. If we returned only the final response, all intermediate state would be invisible. Generators compose naturally: a caller can iterate over turns without the loop caring how they're consumed.

Why max_turns is a correctness requirement, not a preference. Without a hard limit, a tool that keeps returning data prompting more tool calls will run forever. The limit is the circuit breaker. Twenty turns is generous for most tasks. When an agent exceeds it, that usually signals a task design problem, not a reason to raise the limit.

Why append the response before dispatching tools. The model's response must be in the message history before we add tool results. If we skip this step, the model's next turn sees tool results with no preceding assistant message, which most providers reject as a malformed conversation. Order of operations is a correctness invariant, not a convention.

Generator vs. callback. Generators compose naturally: one generator can yield from another, building pipelines of agents without coupling layers to each other. Callbacks scatter event handling across multiple sites and make error propagation fragile. The trade-off is that async generator semantics are less familiar than callbacks. Stack traces are less obvious, and the generator lifecycle has a subtle edge case: generators don't guarantee finalization on garbage collection. Always use try/finally inside the generator body to ensure cleanup runs regardless of how the caller exits.

try/finally cleanup ordering matters more in async generators than in regular functions because the generator may be abandoned mid-execution. The caller can break out of the iteration loop at any point. The finally block runs when the generator is garbage-collected or when the caller explicitly closes it. Without it, database connections, file handles, and abort controllers leak. The rule: if your generator opens a resource, the finally closes it, unconditionally.

The State Struct Pattern

Production agent loops don't use simple iteration variables. They carry all mutable state in a typed struct that gets replaced wholesale at every continue site. This is the key insight that separates a loop that's easy to reason about from one that accumulates subtle bugs.

Each iteration destructures the current state at the top, making all names read-only within the turn. When the loop needs to continue, it constructs a fresh state object with the updated values and a typed reason for why the loop continued. No mutation happens inside a turn. Every state change is explicit.

type LoopState = {
  messages: list[Message]
  turn_count: int
  recovery_count: int
  transition: { reason: str } | None
}

state = LoopState(messages=initial_messages, turn_count=0, recovery_count=0, transition=None)

while True:
  messages, turn_count, recovery_count = state.messages, state.turn_count, state.recovery_count

  response = await call_model(messages)

  if response.is_max_output_tokens and recovery_count < MAX_RECOVERY:
    state = LoopState(
      messages=[*messages, *response.messages, continuation_prompt()],
      turn_count=turn_count,
      recovery_count=recovery_count + 1,
      transition={ "reason": "max_output_tokens_recovery" }
    )
    continue

  if not response.has_tool_calls:
    return Terminal(reason="completed")

  results = await dispatch_tools(response.tool_calls)

  if turn_count + 1 > max_turns:
    return Terminal(reason="max_turns")

  state = LoopState(
    messages=[*messages, *response.messages, *results],
    turn_count=turn_count + 1,
    recovery_count=0,
    transition={ "reason": "next_turn" }
  )

The benefits compound:

Avoids stack growth. A loop that recurses (calling itself for each turn) accumulates stack frames across many turns. The iterative state struct pattern uses constant stack depth regardless of turn count.
Makes continuation reasons auditable. When you log state.transition.reason, you know exactly why the loop continued: next_turn, max_output_tokens_recovery, stop_hook_blocking, reactive_compact_retry. Debugging a misbehaving loop becomes a matter of reading the log, not tracing execution.
Prevents accidental state bleed. If you mutate shared variables across iterations, a bug in one turn silently infects the next. Wholesale replacement makes state boundaries explicit.

The recovery_count field is a good example of why this matters: it tracks how many max-output-tokens recoveries have happened in a row. When the loop continues normally (next_turn), it resets to zero. When it continues because of token overflow, it increments. You can't track this correctly with simple mutation across iterations.

Graceful Cancellation

Cancellation is not a cleanup problem. It's a message history correctness problem. When the abort signal fires, any tool_use blocks that were emitted in the assistant message but haven't yet received a tool_result leave the conversation in a malformed state. Most API providers reject subsequent calls with a 400 error if there are unmatched tool_use/tool_result pairs. The loop's abort handler must emit synthetic error tool results for every outstanding tool call before exiting.

There are three distinct points in the loop where the abort signal can fire, each requiring different cleanup:

Path A: mid-streaming, before any tool_use blocks arrive. The model's response stream is in progress. No tool_use blocks have been seen yet. The loop can exit cleanly because there are no outstanding tool calls to match. Yield an interruption signal and return.

Path B: streaming complete, tool_use blocks collected, before tool dispatch. The model's response is fully collected and contains tool_use blocks, but tool execution hasn't started. Every tool_use block must receive a synthetic error tool_result before the loop exits.

Path C: mid-tool-execution. Some tools have started running. Some tool_use blocks have received results, and others haven't. The dispatcher must emit synthetic error results for the in-progress tools that didn't complete.

async def run_loop(messages: list[Message], abort_signal: AbortSignal) -> Terminal:
  pending_tool_uses: set[str] = set()

  while True:
    # Path A check: before starting a new API call
    if abort_signal.aborted:
      emit_missing_tool_results(pending_tool_uses, "Interrupted by user")
      return Terminal(reason="aborted_streaming")

    async for event in call_model_streaming(messages, signal=abort_signal):
      if event.type == "tool_use":
        pending_tool_uses.add(event.id)
      # ... collect message blocks

    # Path B check: after streaming, before tool dispatch
    if abort_signal.aborted:
      emit_missing_tool_results(pending_tool_uses, "Interrupted by user")
      return Terminal(reason="aborted_streaming")

    for call in pending_tool_uses:
      result = await run_tool(call, signal=abort_signal)
      pending_tool_uses.discard(call.id)
      # Each tool's run() checks abort_signal and returns synthetic error if aborted

    # Path C check: after tool dispatch
    emit_missing_tool_results(pending_tool_uses, "Interrupted mid-execution")
    if abort_signal.aborted:
      return Terminal(reason="aborted_tools")

The emit_missing_tool_results helper is the critical piece. It takes the set of tool_use IDs that haven't received results and emits a tool_result with is_error: true for each one. Without this, the next API call will fail with a validation error, leaving the user to see a cryptic error instead of a clean abort.

Pass the abort signal everywhere. The signal should flow into every async operation the loop initiates: the model API call, each tool's execution, and any hook invocations. If you only check it at the top of each iteration, you get coarse-grained cancellation where the loop finishes the current turn before responding to an abort. For truly responsive cancellation, every awaited operation needs to observe the signal.

Termination Strategies

The loop terminates in more ways than max_turns. Understanding all the paths, and what state they leave behind, is essential for building loops that behave predictably under all conditions.

Normal completion. The model returns a response with no tool calls. Yield the final response and return cleanly. Turn count can be anywhere from 1 to max_turns.

max_turns exceeded. The turn counter reaches the limit. Return a Terminal with reason max_turns. The message history is complete and consistent, with all tool_use/tool_result pairs matched. The model simply didn't finish within the budget.

Aborted streaming. The abort signal fired before or during the model's response stream. Return Terminal(reason='aborted_streaming'). Synthetic tool results may have been emitted, and the history is well-formed.

Aborted after tools. The abort signal fired after tool dispatch completed but before the next model call. Return Terminal(reason='aborted_tools'). The history includes all tool results from this turn.

Stop hook prevented continuation. A stop hook ran and signaled that the loop should not continue. This is a deliberate external halt: the hook has evaluated the assistant's response and decided the agent is done (or has exceeded a policy limit). Return Terminal(reason='stop_hook_prevented').

Stop hook blocking re-entry. A stop hook returned a blockingError, a user message that should be injected into the conversation and trigger another turn. The loop does not terminate. Instead, it re-enters with the injected message appended. The transition reason is stop_hook_blocking.

# After stop hooks complete
if stop_hook_result.blocking_errors:
  state = LoopState(
    messages=[*messages, *stop_hook_result.blocking_errors],
    turn_count=turn_count,  # don't increment: this is a re-injection, not a new turn
    recovery_count=recovery_count,  # preserve: see Production Considerations
    transition={ "reason": "stop_hook_blocking" }
  )
  continue

if stop_hook_result.prevent_continuation:
  return Terminal(reason="stop_hook_prevented", messages=messages)

Token budget exhaustion. When the loop is running with a token budget (a maximum total output token count across all turns), a BudgetTracker monitors progress. The logic is not a simple percentage check. It uses diminishing-returns detection:

type BudgetTracker = {
  continuation_count: int
  last_delta_tokens: int
  last_global_turn_tokens: int
}

def check_token_budget(tracker: BudgetTracker, budget: int, total_turn_tokens: int) -> Decision:
  COMPLETION_THRESHOLD = 0.9
  DIMINISHING_THRESHOLD = 500

  delta = total_turn_tokens - tracker.last_global_turn_tokens
  is_diminishing = (
    tracker.continuation_count >= 3 and
    delta < DIMINISHING_THRESHOLD and
    tracker.last_delta_tokens < DIMINISHING_THRESHOLD
  )

  if not is_diminishing and total_turn_tokens < budget * COMPLETION_THRESHOLD:
    tracker.continuation_count += 1
    tracker.last_delta_tokens = delta
    tracker.last_global_turn_tokens = total_turn_tokens
    return Continue(nudge_message=budget_nudge(total_turn_tokens, budget))

  return Stop(diminishing_returns=is_diminishing)

The loop continues as long as progress is being made (each continuation adds at least 500 tokens). If three consecutive continuations each produce less than 500 new tokens, the loop stops even if the budget percentage hasn't been reached (90% threshold). This prevents the model from indefinitely generating tiny fragments when it has effectively completed the task.

Prompt too long. The model returns a context-overflow error instead of a response. The loop can attempt reactive compaction (summarizing old messages) once per session. If compaction has already been attempted, the loop returns Terminal(reason='prompt_too_long').

Reliable loop continuation signal. The API's stop_reason === 'tool_use' field is not a reliable signal for whether to continue the loop. Track a boolean needs_follow_up that is set to true whenever a tool_use content block is detected during streaming. Use this flag, not the API's stop reason, to decide whether to dispatch tools and loop again.

Error Handling Within the Loop

The agent loop must handle errors without corrupting the message history. Each error type has a different recovery strategy.

LLM call errors: max_output_tokens. When the model hits its per-response output limit before finishing, the API returns a partial response with a max_output_tokens stop reason. The loop can recover by appending a continuation prompt: "You were cut off. Please continue from where you stopped." This recovery has a limit (typically 3 attempts per turn) tracked by recovery_count in the state struct.

if response.stop_reason == "max_output_tokens" and recovery_count < MAX_RECOVERY:
  continuation = create_system_message("Continue from where you stopped.")
  state = LoopState(
    messages=[*messages, *response.messages, continuation],
    recovery_count=recovery_count + 1,
    transition={ "reason": "max_output_tokens_recovery" }
  )
  continue

Tool execution errors. When a tool call fails (exception, timeout, or validation error), do not omit the tool result. Emit an error tool_result with is_error: true so the model sees the failure and can retry with different inputs or adapt its approach. Silently swallowing a tool error leaves the model with a tool_use that has no paired result, which causes an API error on the next call.

try:
  result = await tool.run(call.arguments)
  tool_results.append(tool_result(id=call.id, content=result))
except ToolError as e:
  tool_results.append(tool_result(id=call.id, content=str(e), is_error=True))
  # Do NOT re-raise: the model needs to see the failure

Malformed responses and tombstones. When a model fallback occurs (switching to a secondary model after rate limiting or a provider error), partial assistant messages from the failed stream may be left in the message history. These are orphaned messages: they contain content bound to the original model's context (cryptographic signatures, thinking block references) that would cause API errors if replayed to a different model. The solution is to emit tombstone messages that mark orphaned messages for removal. Tombstones are stripped from the message history before the next API call, from the UI rendering, and from transcript serialization. The model effectively never sees the failed partial response.

Error API responses and stop hooks. When the API returns an error response (not a partial output, an actual error), stop hooks must not evaluate it. Stop hooks are designed for valid model responses. An error message is not a valid response, and feeding it to stop hooks causes them to misfire. The correct path is to call a separate stop_failure_hooks path (a non-blocking notification-only path) and then either attempt recovery or return a terminal state.

The reactive compact guard. When the message history grows large enough to cause a prompt-too-long error, the loop can trigger reactive compaction, which summarizes and truncates the history. This should happen at most once per session. The guard flag tracking whether compaction has been attempted must be session-scoped, not turn-scoped. If you reset it at the start of each turn, a loop that compacts, fails again (because the compacted history is still too large), and triggers compaction again will cycle until it runs out of API budget.

Production Considerations

These insights require production experience to know. Each one has caused real bugs in production agent loops.

Synthetic tool_result emission on abort is a hard requirement. Any unmatched tool_use block in the message history causes an API 400 error on the next call. This is not a "nice to have" cleanup. It's a correctness invariant. When the abort signal fires, every outstanding tool_use ID must receive a corresponding tool_result with is_error: true before the loop returns. The failure mode is subtle: the loop exits cleanly, but the next time the user starts a new turn, they see a cryptic API error instead of their response. The fix is always to trace back to an unmatched tool_use from the previous aborted turn.

Stop hook death spiral from session-scoped guard reset. Stop hooks can inject blocking messages that cause the loop to re-enter. If that re-entered turn produces a prompt-too-long error and triggers reactive compaction, the compaction guard must not be reset. The failure mode is an infinite cycle: compact, still too long, error, stop hook fires, compact again. The guard (has_attempted_reactive_compact) must survive across stop hook re-entries. Treat it as session state, not turn state. Warning signs: API costs growing exponentially, loop never returning to the user.

Post-sampling hooks fire before tool dispatch and cannot affect the current turn. There is a common misunderstanding about when post-sampling hooks execute: they run immediately after the model's response stream completes, before tool results are awaited. They fire as fire-and-forget. They do not block tool dispatch, and they cannot change which tools are dispatched in the current turn. They can inject system messages that affect the next turn, but if you're trying to gate tool dispatch on a hook's decision, you need a different hook type (pre-tool hooks, not post-sampling hooks). This mistake produces loops where hook logic appears to run but has no visible effect on tool dispatch.

Token budget diminishing-returns thresholds are empirically calibrated. The 500-token threshold per continuation and 3-consecutive-check requirement exist because the model can continue producing tokens indefinitely in tiny fragments when it has effectively completed the task. These values are not derivable from the problem statement. They emerged from observing real loops running over-budget. If you build your own budget tracker, start with these values as calibration targets: 90% completion threshold, 500 tokens per continuation, 3 consecutive checks before stopping. Your task distribution may need different values, but these are the right starting point.

The stop_reason API field is unreliable for loop continuation. The API contract says stop_reason === 'tool_use' when the model stopped because it emitted tool calls. In practice, this field is not always set correctly. Production loops that rely on it exit prematurely when tool calls are present in the response. The correct sentinel is a local boolean (needs_follow_up) that is set to true whenever a tool_use content block is detected during streaming. Use the local flag, not the API field, for loop continuation decisions.

Tombstoning is the correct recovery from orphaned streaming messages, not history truncation. When a model fallback leaves partial assistant messages in the history, the temptation is to truncate the history to remove them. Truncation loses valid user and tool messages that came before the orphaned assistant message. Tombstoning is a targeted removal: mark the specific orphaned message for removal, leave everything else intact. Downstream consumers (transcript serializers, UI renderers) that understand tombstones will strip them. Those that don't will ignore them safely.

Best Practices

Do track loop continuation with a local flag, not the API's stop_reason. Set needs_follow_up = True whenever you see a tool_use block during streaming. Use this flag to decide whether to dispatch tools and continue. The API stop_reason is informational. Your local flag is authoritative.

Don't omit tool results for failed or aborted tools. Every tool_use block must have a matching tool_result in the history, even if the tool failed or the loop was aborted. Emit error results with is_error: True and a descriptive error message. The model uses these to adapt, and the API requires them for correctness.

Do use a typed state struct at every continue site. When the loop needs to iterate, replace the state struct wholesale with a new instance that includes a typed transition.reason. This makes continuation reasons auditable and prevents state leakage between turns.

Don't treat the reactive compact guard as per-turn state. The flag tracking whether reactive compaction has been attempted must survive across all loop iterations including stop hook re-entries. Reset it only when the session ends, not between turns.

Do propagate the abort signal to every awaited operation. Pass the abort signal into model API calls, tool executions, and hook invocations. Checking it only at the top of each iteration gives coarse-grained cancellation where the current turn must complete before the abort takes effect.

Don't rely on max_turns as the only termination condition. Build your loop to understand all terminal states: token budget exhaustion, stop hook prevention, prompt overflow, and clean abort. Surfaces that only handle max_turns will appear to hang or produce cryptic errors when other termination conditions fire.

Do use try/finally in the generator body. The generator may be abandoned mid-execution. The finally block is the only reliable place to release resources: close connections, cancel child tasks, flush logs. Without it, resources leak silently.

Don't evaluate stop hooks on API error responses. Stop hooks are designed for valid model responses. When the API returns an error (context overflow, rate limiting, model error), call a failure notification path instead. Evaluating stop hooks on error responses causes them to misfire and can trigger the stop hook death spiral.

Tool System Design: Tools are typed functions with metadata that makes them agent-safe. The schema tells the model what arguments to pass, and the concurrency class tells the dispatcher whether parallel execution is safe. The dispatch algorithm (how consecutive safe tools are grouped into batches and unsafe tools run serially) is the other half of what the loop does on each turn.
Memory and Context: As the message list grows across turns, the loop faces context pressure. This page covers the hierarchy of forgetting, from in-context history to compressed summaries to long-term storage, and when reactive compaction should fire. The reactive compact guard discussed in Production Considerations above is the loop's interface with the memory system.
Error Recovery: The loop's error handling strategies (retry logic, circuit breakers, tiered recovery) connect to the broader error recovery patterns covered on this page. Tombstoning, synthetic tool results, and continuation prompts are specific applications of the more general error recovery framework.
Streaming and Events: The generator pattern yields streaming events that callers consume. This page covers the event type system, how partial results flow from the model through the loop to the UI, and how backpressure prevents the event stream from overwhelming consumers.
Prompt Architecture: The system prompt is assembled before each loop iteration. Understanding the two-zone model (cached static + volatile dynamic) and the prompt assembly pipeline explains how the loop's context changes across turns.
Multi-Agent Coordination: Multi-agent systems coordinate multiple agent loops. Spawning backends, mailbox communication, and tool partitioning are the patterns that connect individual loops into collaborative systems.
Observability and Debugging: The span hierarchy in session tracing mirrors the agent loop lifecycle. Each loop iteration produces spans that are the primary debugging surface for understanding agent behavior.
Pattern Index: All patterns from this page in one searchable list, with context tags and links back to the originating section.
Glossary: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.