Home/Guides/Debug Your Agent

Debug Your Agent

How to find out what your agent is doing wrong. Covers structured event logging, cost tracking, session tracing, and the signatures of common failure patterns.

Your agent is doing something wrong. Maybe it is calling the wrong tool. Maybe it is looping endlessly. Maybe it produces a plausible-looking answer that is quietly incorrect. Standard debugging techniques (breakpoints, stack traces, print statements) are not enough. Agent failures are non-deterministic (the same input can produce different behavior), context-dependent (the failure depends on what happened five turns ago), and often silent (the agent does not crash, it just gives a wrong answer with confidence).

Debugging an agent requires a different approach: structured observability that lets you reconstruct what happened, what it cost, and where time went. This guide builds that observability in three layers, then shows you the signatures of the most common failure patterns so you know what to look for.

Layer 1: Structured Event Logging

The first question when debugging is "what happened?" The answer should come from a structured event log: a timeline of named events with typed metadata that you can query after the fact.

The key word is structured. A log line like "Tool called successfully" tells you nothing useful. A structured event like tool_called duration_ms=450 success=true retry_count=0 tells you exactly what happened, how long it took, and whether it retried.

The following sets up a structured event logger:

class EventLogger:
  events: list = []

  function log(self, event_name: str, metadata: dict):
    entry = {
      timestamp: now_ms(),
      event: event_name,
      **metadata
    }
    self.events.append(entry)

  function query(self, event_name: str) -> list:
    return [e for e in self.events if e.event == event_name]

logger = EventLogger()

Wire the logger into every decision point in the agent loop:

async function agent_loop_with_logging(question: str, tools: list) -> str:
  messages = [system_message(prompt), user_message(question)]
  logger.log("session_start", { turn_count: 0 })

  for turn in range(max_turns):
    logger.log("turn_start", { turn: turn, message_count: len(messages) })

    response = await llm.call(messages, tools=tools)
    logger.log("llm_response", {
      turn: turn,
      has_tool_calls: len(response.tool_calls) > 0,
      token_count: response.usage.total_tokens
    })

    if response.tool_calls is empty:
      logger.log("session_complete", { turns_used: turn + 1 })
      return response.text

    for call in response.tool_calls:
      start = now_ms()
      result = await dispatch_tool(call.name, call.args)
      logger.log("tool_called", {
        turn: turn,
        tool: call.name,
        duration_ms: now_ms() - start,
        success: not result.is_error,
        result_size: len(str(result))
      })
      messages.append(tool_result(call.id, result))

  logger.log("session_timeout", { turns_used: max_turns })
  raise RuntimeError("agent exceeded max_turns")

Every significant event (session start, turn start, LLM response, tool call, session end) gets a structured log entry with typed metadata. When debugging, you can query the log to answer specific questions: "Which tool took the longest?" "How many tokens did turn 5 use?" "Did any tool call fail?"

Tip: Log the model's tool selection reasoning alongside the call when available. When the agent picks the wrong tool, the reasoning tells you why, and the fix is usually in the tool schema descriptions, not in the tool implementation.

Layer 2: Cost Tracking

Agents are expensive. Each LLM call costs tokens, and tokens cost money. Cost tracking serves two purposes: real-time budget awareness during a session, and post-session analysis to understand where money went.

The following tracks costs at three scopes (per call, per session, and per model):

class CostTracker:
  total_cost_usd: float = 0.0
  total_input_tokens: int = 0
  total_output_tokens: int = 0
  total_cache_read_tokens: int = 0
  per_model: dict = {}
  per_turn: list = []

  function record(self, model: str, usage: TokenUsage):
    cost = calculate_cost(model, usage)
    self.total_cost_usd += cost
    self.total_input_tokens += usage.input_tokens
    self.total_output_tokens += usage.output_tokens
    self.total_cache_read_tokens += usage.cache_read_tokens

    if model not in self.per_model:
      self.per_model[model] = ModelCost(cost=0.0, calls=0)
    self.per_model[model].cost += cost
    self.per_model[model].calls += 1

    self.per_turn.append(TurnCost(model=model, cost=cost, tokens=usage))

  function is_runaway(self) -> bool:
    if len(self.per_turn) < 5:
      return False
    recent = self.per_turn[-5:]
    # Runaway signal: cost per turn is increasing consistently
    return all(recent[i].cost > recent[i-1].cost for i in range(1, len(recent)))

The is_runaway method detects a specific failure pattern: the agent is in a loop where each iteration is more expensive than the last. This happens when the agent keeps accumulating context (tool results, conversation turns) without compacting, and each subsequent LLM call processes a larger input. An increasing cost trajectory over five consecutive turns is a strong signal that something is wrong.

Track four token types separately. Input tokens and output tokens are billed at standard rates. Cache-read tokens (served from the provider's prompt cache) are billed at a fraction of the input rate. If you aggregate all tokens into a single counter, you cannot tell whether prompt caching is working. A high ratio of cache-read tokens to input tokens means caching is effective. A low ratio means you are paying full price for prompts that should be cached.

Layer 3: Session Tracing

Event logging tells you what happened. Cost tracking tells you what it cost. Tracing tells you where time went and how operations relate to each other.

A trace is a tree of spans, each representing a timed operation:

session trace
  |-- llm_request (2.3s)
  |-- tool: search_files (0.4s)
  |-- llm_request (1.8s)
  |-- tool: read_file (0.1s)
  |-- tool: write_file (0.2s)
  |     |-- permission_check (0.05s)
  |-- llm_request (1.5s)

The following implements trace context propagation using async-local storage:

class TraceContext:
  spans: list = []
  active_span: Span or None = None

  function start_span(self, name: str, parent: Span or None = None) -> Span:
    span = Span(
      name=name,
      start_time=now_ms(),
      parent=parent or self.active_span,
      children=[]
    )
    if span.parent:
      span.parent.children.append(span)
    self.spans.append(span)
    self.active_span = span
    return span

  function end_span(self, span: Span):
    span.end_time = now_ms()
    span.duration_ms = span.end_time - span.start_time
    self.active_span = span.parent

The trace context propagates automatically through async operations. When the agent loop starts a tool span, any sub-operations (permission checks, file I/O) automatically become children of that span. The resulting trace tree shows exactly where time was spent and how operations nested.

Use traces to answer questions like: "Why did turn 3 take 8 seconds?" (the tool call inside it hit a timeout and retried). "Why is the session so expensive?" (one LLM call is processing 100K tokens because context was not compacted).

Common Failure Patterns

Once you have structured logs, cost tracking, and traces, you can identify the most common agent failure patterns by their signatures:

Infinite tool loop. The agent calls the same tool repeatedly with the same or similar arguments. Signature in logs: the same tool name appears in consecutive turns, the cost tracker shows an increasing trajectory, and the turn count approaches max_turns.

# Signature: same tool called repeatedly
turn 12: tool=search_files args={pattern: "config.yml"}
turn 13: tool=search_files args={pattern: "config.yml"}
turn 14: tool=search_files args={pattern: "config.yaml"}
turn 15: tool=search_files args={pattern: "config.yml"}

The fix is usually in the tool's error response. If the tool returns an unhelpful error ("not found"), the model retries. Return a specific, actionable error ("no file matching 'config.yml' exists in /project. Available config files: settings.json, env.toml") and the model moves on.

Context overflow. The agent's responses degrade: shorter, less accurate, ignoring earlier instructions. Signature: the message_count in turn logs is high (30+), the token_count per turn is near the context window limit, and the result_size of tool calls is large (10K+ characters).

The fix is compaction. See Manage Conversation Context for the compaction pipeline.

Wrong tool selection. The agent picks a tool that does not match the task. Signature: a tool call fails with a semantic error (not a schema error), or the tool succeeds but the result is irrelevant to the conversation. Look at the tool schema descriptions. If the description is vague, the model cannot distinguish between similar tools.

The fix is better schema descriptions. Add explicit guidance on when to use and when not to use each tool. The model reads descriptions to make selection decisions.

Silent wrong answers. The agent produces a confident answer that is factually incorrect. This is the hardest failure to detect because there is no error signal. The signature is absence: no tool errors, no timeout, no loop, just a wrong answer delivered smoothly.

The fix requires verification hooks or human review. Add a post-completion check that validates the answer against the tool results that were used to produce it. If the agent claims "the file contains X" but the read_file result shows something different, flag the discrepancy.

  • Observability and Debugging. The full observability architecture: sink queue pattern for event logging, metadata type restrictions for PII safety, per-model cost tracking with cache separation, span hierarchy with async-local propagation, and orphan span cleanup.
  • Error Recovery. Once you have identified the failure, the escalation ladder (retry, fallback, degrade, fail) tells you what to do next.
  • Agent Loop. Understanding the loop lifecycle helps you interpret traces and identify which phase of the loop is causing problems.