Home/Patterns/Advanced Patterns/Observability and Debugging

Observability and Debugging

Three independent observability layers for agent systems: structured event logging, per-model cost tracking, and session tracing with span hierarchy, each answering a different debugging question.

An agent failure is harder to diagnose than a typical service failure. The timeline is long, spanning dozens of LLM calls, tool executions, and permission checks. The state is large, encompassing conversation history, tool outputs, and compacted memory summaries. The cost varies: a single debugging session can cost more than a day's worth of routine usage. And the failure mode is often silent: the agent produces a plausible-looking output that is quietly wrong.

You need three distinct lenses to see into a running agent system, and each lens answers a different question:

  • "What happened?": Structured event logging. A discrete timeline of named events (tool calls, LLM requests, permission grants, session starts) that you can query after the fact or stream in real time.
  • "What did it cost?": Cost tracking per model, per session, and across sessions. Broken down by token type so you can see whether prompt caching is working.
  • "How long did it take and where?": Session tracing. A span hierarchy that maps every operation to its duration, parent, and cause, from the user prompt down to individual tool calls and permission checks.

These three systems are independent. You can enable or disable each one without affecting the others. In a full production deployment, all three run simultaneously and complement each other: the event log tells you what happened, cost tracking tells you what it cost, and tracing tells you where time went.

Layer 1: Structured Event Logging

The foundational problem with event logging in agent systems is startup ordering: events are produced before the logging sink is ready. The agent starts processing before initialization is complete. If you drop events during this window, your logs are unreliable from the first line.

The sink queue pattern solves this:

# Events are logged at any time, before the sink exists
log_event("startup_began", { mode: 1 })    # goes to in-memory queue
log_event("config_loaded", { success: True }) # still queued

# Later: sink attaches during initialization
attach_analytics_sink(sink)   # drains the queue via microtask scheduling

# After attachment: events route directly to the sink
log_event("session_started", { duration_ms: 250 })  # direct

The queue is a FIFO buffer that accumulates events before the sink is ready. When the sink attaches, it drains the queue via a microtask (not a synchronous loop) to avoid blocking the main execution path. After draining, all subsequent events go directly to the sink.

Sink attachment is idempotent: the first call attaches and wins, subsequent calls are silently no-ops. This prevents double-routing if initialization code is called more than once, but it also means the order of initialization matters. If a test environment resets state between test cases, it must explicitly detach and reattach the sink, not just re-call the attach function.

Metadata type restriction is the other key design decision: event metadata accepts only boolean, number, and undefined values, not strings.

# Correct: safe metadata, numbers and booleans only
log_event("tool_called", {
  duration_ms: 450,
  success: True,
  retry_count: 0,
  cache_hit: False,
})

# What the type system prevents:
# log_event("tool_called", { file_path: "/home/user/secret.txt" })
# log_event("request_done", { user_prompt: "how do I..." })

This is not a limitation. It is a deliberate safety contract. Strings can contain file paths, user prompts, code snippets, and other PII. By restricting metadata to numbers and booleans, accidental PII leakage into the event log becomes a compile-time error rather than a runtime incident. When a string genuinely needs to be logged, it must go through a separate verified-metadata path that carries explicit acknowledgment that the string has been reviewed for PII, a friction point that makes leakage a deliberate, auditable act rather than an accident.

Two separate logging systems run simultaneously and they must be independent:

  • First-party logging routes to internal analytics infrastructure. Batched to a backend endpoint, enriched with session, environment, and deployment metadata. Subject to internal retention policies.
  • Third-party logging routes to customer-configured telemetry backends using standard export formats. Receives only events the customer has opted into. Uses a separate logger provider instance.

The strict separation matters: internal analytics events must not appear in customer telemetry pipelines, and customer events must not leak into internal analytics. Using separate provider instances (rather than a shared provider with different destinations) is the only reliable way to enforce this boundary at the type level, not the configuration level.

Event sampling allows high-frequency events to be logged at rates below 100%. A sample rate is attached to each event's metadata, so downstream analysis systems can correct for the sampling when computing aggregate statistics. This prevents high-volume tool-call events from flooding the event store while still providing statistically accurate aggregates.

Layer 2: Cost Tracking

Cost tracking serves two needs: real-time budget awareness during a session, and post-session analysis to understand where money went. Meeting both needs requires tracking at three scopes simultaneously.

Per-API-call scope: every LLM API response carries token usage. Accumulate immediately:

# Per API call: update all three scopes
add_to_session_cost(
  cost=calculate_usd_cost(model, usage),
  usage=usage,    # input_tokens, output_tokens, cache_read, cache_creation
  model=model,    # the model that processed this call
)

Per-session scope: in-memory aggregation across all API calls in the current session:

# Retrieve breakdown at any point during the session
for model, usage in get_model_usage().items():
  total = usage.input_tokens + usage.output_tokens
  print(f"{model}: ${usage.cost_usd:.4f}, "
        f"{total} tokens ({usage.cache_read_tokens} from cache)")

# Total cost across all models
session_total = get_total_cost_usd()

Cross-session persistence: written to project config, keyed by session ID:

# Before session ends: persist cost state
save_current_session_costs()   # writes to project config with session_id key

# On session resume: restore state for the matching session
restore_cost_state_for_session(session_id)
# Only restores if session_id matches, so unrelated sessions don't contaminate each other

The four token types must be tracked separately because their billing differs. Input tokens and output tokens are billed at standard rates. Cache-read tokens (served from a provider's prompt cache) are billed at a fraction of the input rate. Cache-creation tokens (written into the cache on first use) may carry an additional write charge. Aggregating all of these into a single "tokens used" counter makes it impossible to tell whether prompt caching is working. The cache-read count being low means you're paying full price for prompts that should be cached.

Multiple API model strings that map to the same underlying model are summed by canonical model name. This means a session that uses fast-model-v1 and fast-model-v1-turbo (different API identifiers, same billing family) shows their cost as a single line item under the canonical name, which is what you want for budget tracking.

Layer 3: Session Tracing

Session tracing maps the agent's activity to a span hierarchy, a tree of timed operations with parent-child relationships. The hierarchy mirrors the agent loop lifecycle:

interaction (user prompt to agent response)
  ├── llm_request (one LLM API call)
  ├── llm_request (another LLM API call if the first triggered more)
  ├── tool (one tool execution)
  │   └── blocked_on_user (awaiting permission grant)
  └── tool (another tool execution)
      └── hook (beta tracing only: hook spans are a separate opt-in)

The key implementation insight is context-local propagation: spans don't need to be passed explicitly through every function call. An async-local storage context holds the current span, and any code that runs in that context automatically knows its parent:

# Interaction span: root of the trace
interaction_span = start_interaction_span(user_prompt)
# Context-local storage now holds this span as the active context

# LLM request: auto-parented to interaction via context-local storage
llm_span = start_llm_request_span(model="fast-model")
response = await query_llm(messages)
end_llm_request_span(llm_span, metadata={
  input_tokens: response.usage.input_tokens,
  output_tokens: response.usage.output_tokens,
  success: True,
})
# Note: we pass llm_span explicitly to end it. See Production Considerations.

# Tool span: also auto-parented to interaction
tool_span = start_tool_span("bash", {"command_type": "read"})
# ... tool executes ...
end_tool_span(tool_result, result_tokens=42)

# Interaction ends: all child spans complete before this
end_interaction_span()

Context-local propagation means code deep in the call stack can start and end its own spans without the caller needing to thread a context parameter through every layer. The tracing system works across async boundaries. A span started in an async task is the parent of spans started in any code that awaits in that task's context.

Three tracing backends run independently and can all be active simultaneously:

BackendWhat It AddsUse Case
Standard enhanced telemetryCore span hierarchy (interaction, llm_request, tool, blocked_on_user)Production observability, dashboards
Beta tracingAdditional attributes on existing spans and hook spansDevelopment debugging, hook behavior analysis
Performance tracingSeparate trace format for profiling toolsPerformance profiling, flamegraph analysis

Hook spans appear only in beta tracing, not in standard enhanced telemetry. Performance tracing emits to a separate trace file optimized for profiling tools, not to the same OTLP endpoint as standard telemetry.

Orphan span cleanup is a non-obvious requirement. In production, aborted LLM streams and uncaught exceptions leave spans open. The code that would call end_span() never runs. Without cleanup, the active span collection grows indefinitely over the lifetime of a long-running agent process.

The solution is a background cleanup interval that uses weak references:

# Active spans stored as weak references
active_spans: dict[span_id, WeakRef[SpanContext]] = {}

# Background cleanup: runs every 60 seconds
cleanup_orphan_spans():
  cutoff = now() - SPAN_TTL_MS   # 30 minutes
  for span_id, ref in list(active_spans.items()):
    span = ref.deref()
    if span is None:              # GC collected it, already gone
      del active_spans[span_id]
    elif span.started_at < cutoff:  # alive but open too long
      span.end(status="abandoned")
      del active_spans[span_id]

The weak reference is the key: when context-local storage clears (task ends, context exits) and no other code holds a reference to the SpanContext, the GC collects it. The next cleanup pass finds a stale WeakRef and removes the entry. Spans that are still alive after 30 minutes are explicitly ended with an "abandoned" status. Without the TTL, a single crashed agent turn can leak an open span that stays in memory for the lifetime of the process.

Debugging Agent Failures

This is the practical section: how to use the three layers to diagnose specific failure types.

Cost spikes: Go to Layer 2 first. The per-model breakdown shows which model and which operation type is driving the cost. Check the ratio of cache_read_tokens to input_tokens. A low cache-read ratio means prompt caching is ineffective. Either the system prompt is changing too frequently between calls (invalidating the cache), or the caching configuration is wrong. This is one of the most common silent cost problems: everything works, it just costs 10x more than it should.

Slow tools: Go to Layer 3. Tool spans show exact execution duration from the point the tool was dispatched to the point its result was returned. If a tool span is slow but the tool itself is fast, look for blocked_on_user sub-spans. They indicate permission prompts that required human approval. The bottleneck isn't the tool. It's the permission gate. Long permission wait times are invisible without tracing because the tool reports success with correct output. Only the duration reveals the issue.

Permission failures: Layer 3 blocked_on_user spans correlate with the permission decision. High counts of blocked_on_user spans that resolve quickly (auto-approved or already-cached) are normal. A spike in spans where blocked_on_user duration is long means the system is waiting for human approval on operations that should be pre-approved. Cross-reference with the event log: if the permission system is emitting denial events, the classifier may be over-triggering on safe operations.

Hook failures: Layer 1 event log captures hook outcomes. Non-blocking hook errors are logged to the session transcript, but the transcript is long, and a non-blocking error mid-session is easy to overlook. If you suspect hook misbehavior, filter the event log for hook-outcome events and look for non_blocking_error outcomes. Agent hooks that silently cancel show as outcome: "cancelled" with no associated error message. The only way to detect them is through the event log or the tracing layer.

Missing events: If events appear inconsistently in the log (some sessions have them, some don't) check sink attachment timing. The sink queue buffers events before the sink attaches, but the queue has a maximum capacity. If the agent produces a large burst of events during startup before the sink is ready, early events may be dropped. Fix: ensure the sink attaches as early as possible in the initialization sequence, before any event-producing code runs.

Incorrect cost attribution in parallel systems: In systems that run multiple LLM requests concurrently, always pass the explicit span object returned by start_llm_request_span() to end_llm_request_span(). The tracing system has a fallback that finds "the most recently started LLM request span" when no span is provided, but in parallel execution, this may be the wrong span. Token counts can be attributed to the wrong model, making cost breakdowns appear correct while being silently wrong.

Privacy and Redaction

Prompt content is redacted by default. The agent collects user prompts, system prompts, and tool results, all of which may contain PII. By default, none of this content flows into the tracing backend or the event log. Only metadata (durations, token counts, model names, success/failure indicators) is emitted.

Prompt content logging is opt-in via an environment variable. When enabled, user prompt text is included in the span attributes for the interaction span, and tool inputs/outputs appear in the tool span attributes. This opt-in is session-level: enable it for development debugging sessions, disable it for production.

The verified-metadata contract extends this principle to structured metadata. When a developer needs to log a string value (for example, a sanitized version of a file path or a hash of a prompt) they must go through a separate verified-metadata path. The type system marks these values with an explicit acknowledgment that the developer has reviewed the value for PII. This isn't bureaucratic overhead. It's a mechanism that makes PII exposure a deliberate, reviewable act rather than a typo.

The practical result: in a default configuration, you can share your event logs and traces freely. They contain no user content. This matters in regulated industries (healthcare, finance, legal) where sharing operational logs with external monitoring vendors requires data review processes. When privacy is the default, you start from a safe baseline and opt in deliberately.

Production Considerations

1. Orphan span TTL prevents indefinite memory growth.

Without explicit TTL cleanup, every aborted LLM stream or uncaught exception leaves an open span. In a long-running agent process that handles hundreds of sessions, this accumulates. The 30-minute TTL with background cleanup is the minimum safe configuration. For agents with very long turns (complex multi-step tasks that run for hours), extend the TTL proportionally. Monitor the count of spans ended with abandoned status as a health metric. A rising count means something is failing to close spans normally.

2. Parallel LLM requests require explicit span passing.

When multiple LLM requests run concurrently (common in warmup scenarios, parallel tool evaluation, or multi-agent coordination) the legacy "find the most recent llm_request span" fallback is unreliable. Always pass the specific span returned by start_llm_request_span() to end_llm_request_span(). The span is returned explicitly for this reason. Systems that skip this step will see correct token totals but incorrect per-model attribution in parallel-execution scenarios.

3. Analytics sink attachment is idempotent. First attachment wins.

If your initialization code is refactored and the analytics sink gets attached in two different code paths, the second attachment silently no-ops. From the caller's perspective, the sink attached successfully, but events from any code that expected the second sink are silently routed to the first sink. In test environments that reset global state between test cases, explicit teardown (detach, then reattach) is required. Simply calling attach again will not reroute events.

4. The event queue has a bounded capacity.

The in-memory event queue that buffers events before the sink is ready has a maximum capacity (typically on the order of a few thousand events). During a particularly busy startup (many plugins loading, many configuration events firing) the queue can fill before the sink attaches. Events beyond the capacity are dropped silently. The fix is always the same: attach the sink earlier in the initialization sequence. If startup necessarily produces many events before a sink can attach, increase the queue capacity for your workload.

5. Dual logging systems use separate provider instances.

Running the first-party and third-party loggers on separate provider instances is not redundancy. It is isolation. A single shared provider with two exporters could route an internal analytics event to a customer OTLP endpoint if the filtering configuration has a bug. Separate instances make this impossible at the architectural level: the first-party provider cannot accidentally reach the third-party exporter because it doesn't hold a reference to it.

Best Practices

  • Do track cost per-model with the four token types separated. Aggregating all tokens into one counter hides prompt caching effectiveness, a critical optimization signal in production agent systems.

  • Do pass explicit span objects to span-ending calls. Never rely on the "most recent span" fallback in concurrent systems. The explicit span is returned for a reason.

  • Do use the verified-metadata contract for any string that might contain PII. The friction is the point. Make PII logging deliberate, not accidental.

  • Don't log prompt content by default. Default-off is the only safe default for user data. Opt in explicitly for development sessions, and ensure opt-in is per-session, not global.

  • Don't rely on GC alone for span cleanup. Long-running agents need a background TTL cleanup interval. Weak references plus a periodic sweep is the production pattern.

  • Do attach the analytics sink exactly once, as early as possible in the initialization path, before any event-producing code runs. If you have two initialization code paths, pick one canonical path and route both through it.

  • Do monitor the blocked_on_user span duration distribution. Unexpectedly long values mean users are waiting at permission prompts, a performance problem that looks like tool latency without tracing.

  • Don't aggregate cost by session without also tracking by model. "This session cost $0.50" is less useful than "this session cost $0.45 in the reasoning model and $0.05 in the fast model, and 80% of the fast model cost was cache misses."

  • Agent Loop: The span hierarchy mirrors the agent loop lifecycle. Each loop iteration corresponds to one or more spans in the session trace.
  • Tool System: Tool spans are the primary debugging surface for slow operations. The tool's concurrency class and dispatch behavior affect span timing.
  • Hooks and Extensions: Hook spans appear in beta tracing. Hook outcomes are logged as first-class events. Observability and hooks compose to give complete visibility into hook behavior and timing.
  • Error Recovery: The event log feeds the error recovery decision tree: logged hook failures, tool errors, and permission denials are the raw signals that error recovery classifies as retryable or permanent.
  • Streaming and Events: The streaming event pipeline is the primary data source for observability. Event logging, cost tracking, and session tracing all consume the same typed event stream that the streaming system produces.
  • Pattern Index: All patterns from this page in one searchable list, with context tags and links back to the originating section.
  • Glossary: Definitions for domain terms used on this page.