Tool System Design | ClaudePedia

A tool is how an agent acts on the world. Without tools, an agent can only produce text. With tools, it can read a database, call an external API, write a file, run a subprocess, or search the web. Every action the agent takes beyond generating text goes through a tool.

But a plain function isn't enough. The agent loop dispatches tool calls based on what the model requests, and the model has no idea what your code looks like. It only knows what you tell it. That means a tool is really two things: the function that does the work, and the metadata that tells the rest of the system how to handle it safely. The metadata is the design. The function body is the plumbing.

A tool is a typed function with metadata. The metadata has three parts:

Schema: the typed contract between the model and the code. The model reads this to know what arguments to pass. Without a schema, the model is guessing.
Concurrency class: tells the dispatcher whether parallel execution is safe. Some tools read files and can run in parallel. Others write to shared state and must run serially.
Behavioral flags: cross-cutting concerns declared as data, not embedded in the function body. Is this tool destructive? Does it require user permission? Can it be interrupted?

Here is a tool definition with full metadata:

tool search_files(
  pattern: string,
  directory: string,
) -> SearchResult[]:
  metadata:
    schema: { pattern: string, directory: string }
    concurrency_class: READ_ONLY
    max_result_size_chars: 500_000
    behavioral_flags:
      is_destructive: false
      requires_permission: false
      interrupt_behavior: 'block'

  return filesystem.search(pattern, in=directory)

The function body is two lines. The metadata is seven. That ratio is intentional. Most of the tool design work is in the metadata, because the metadata is what makes the tool usable in an automated system where you can't watch every call.

Concurrency Classes

The concurrency class answers one question: is it safe to run this tool in parallel with other tools?

The three classes form a spectrum:

READ_ONLY: The tool only reads. No shared state is modified, so multiple instances can run concurrently without interference. Example: searching files, fetching a URL, reading a config value.
WRITE_EXCLUSIVE: The tool writes to shared state. It must run serially: other tools must finish before this one starts (or vice versa). Example: writing a file, inserting a database row, sending an email.
UNSAFE: The tool has side effects that are hard to bound or undo. It runs serially and in isolation, in a subprocess or sandbox where it can't interfere with the agent's own state. Example: executing arbitrary shell commands, running untrusted code.

One subtlety worth naming: concurrency class is determined at dispatch time, not at registration time. The dispatcher calls is_concurrency_safe(parsed_input) once per tool call, passing the actual arguments. A shell execution tool running ls might be concurrency-safe. The same tool running a destructive command is not. This is not a static property stamped onto the tool type. It's a runtime judgment about a specific invocation.

If is_concurrency_safe throws (for example, because the input fails to parse), the conservative fallback is false. Schema parse failure also defaults to false. The system treats all ambiguous cases as unsafe. It never optimistically assumes concurrent dispatch is okay.

The Dispatch Algorithm

When the model returns multiple tool calls in a single response, the dispatcher must decide which tools to run in parallel and which to serialize. The algorithm is partition-then-gather: split the tool call list into consecutive batches, where each batch is either a concurrent group of safe tools or a single unsafe tool. Process batches sequentially. Within each safe batch, run tools in parallel.

The partitioning rule is simple: extend the current safe batch if the current tool is safe and the previous batch is also safe. Otherwise start a new batch.

The following example shows the core partition logic:

function partition_tool_calls(calls: ToolCall[], context) -> Batch[]:
  batches = []
  for call in calls:
    tool = find_tool(call.name)
    try:
      parsed = tool.input_schema.parse(call.input)
      is_safe = tool.is_concurrency_safe(parsed)
    except:
      is_safe = False   # conservative: treat parse failure as unsafe

    if is_safe and batches and batches[-1].is_safe:
      batches[-1].calls.append(call)   # extend existing safe batch
    else:
      batches.append(Batch(calls=[call], is_safe=is_safe))

  return batches

And the dispatch loop that processes batches:

async function dispatch_all(batches: Batch[], context) -> Result[]:
  results = []
  for batch in batches:
    if batch.is_safe:
      batch_results = await gather(*[run_tool(c, context) for c in batch.calls])
    else:
      batch_results = []
      for call in batch.calls:
        batch_results.append(await run_tool(call, context))
    results.extend(batch_results)
  return results

A critical property: safe groups on either side of an unsafe tool are never merged. If the model requests [search_A, search_B, write_C, search_D, search_E], the batches are [search_A, search_B], [write_C], [search_D, search_E] (three separate batches, processed in that order). The second safe group doesn't execute until write_C completes. Order is preserved across the full result list.

The maximum number of tools that can run concurrently within a safe batch is capped (a default of 10, configurable via environment variable). This prevents a single model response with 50 safe tool calls from overwhelming downstream services.

Tool Lifecycle

Every tool call goes through a fixed sequence of phases before the result is returned to the model. Understanding this lifecycle matters because each phase can fail, and each failure produces a tool_result with is_error: true that the model sees and can act on.

Phase 1: Schema validation (shape and types)

The raw input from the model is parsed against the tool's declared schema. If the model provided arguments of the wrong type, passed an unknown field, or omitted a required field, this phase fails and returns an error result. The error message includes the schema mismatch details, giving the model enough information to retry with corrected input.

Phase 2: Semantic validation (business logic)

Each tool can implement an optional validate_input method that runs after schema parsing succeeds. This phase validates business logic: does the file path exist? Is the command in the allow list? Are the date ranges valid? Like schema failures, semantic failures return an error result with an explanatory message.

The two phases always both run. There is no short-circuit where schema success skips semantic validation. Together they form a two-layer validation approach:

async function execute_tool(tool, call, context) -> ToolResult:
  # Phase 1: Schema validation (shape and types)
  parsed = tool.input_schema.safe_parse(call.input)
  if not parsed.success:
    return error_tool_result(
      call.id,
      f"InputValidationError: {parsed.error}"
    )

  # Phase 2: Semantic validation (business logic)
  validation = await tool.validate_input(parsed.data, context)
  if validation.result == False:
    return error_tool_result(
      call.id,
      f"ValidationError: {validation.message}"
    )

  # Execute with validated, semantically-checked input
  return await tool.call(parsed.data, context)

Phase 3: Permission check

After both validation phases pass, the permission system evaluates whether this tool call is allowed to proceed. Permissions are checked on validated input (the parsed, semantically-checked arguments) so classifiers and permission rules see the same structured data that the tool will receive.

Phase 4: Pre-tool hooks

Before the tool executes, any registered pre-tool hooks run. Hooks can inspect the tool call, inject additional context, modify the input, or block execution. A hook that blocks returns an error result without ever calling the tool function.

Phase 5: Execute

The tool function runs with the validated and possibly hook-modified input.

Phase 6: Post-tool hooks

After execution completes, post-tool hooks run. They observe the result but typically cannot modify it. They're used for logging, analytics, and side effects.

Each validation failure and each hook rejection produces a properly formatted tool_result message, so the model always receives a complete response for every tool_use it issued. An incomplete response (a tool_use with no matching tool_result) causes an API error on the next request.

Interrupt behavior

Every tool can declare what happens when a user submits a new message while the tool is still running:

'cancel': stop the tool immediately and discard its result
'block': keep running, and the user's new message waits until the tool finishes

The default is 'block'. Long-running read operations typically use 'block' because stopping mid-read could leave the agent in an inconsistent state. Destructive operations might use 'cancel' because if the user explicitly wants to stop a delete operation, stopping it is the right call.

Behavioral Flag Composition

The behavioral flags on a tool are declarations, not enforcement mechanisms. The dispatcher, permission system, and UI read these flags to make routing decisions. The flags themselves don't restrict anything. This separation matters because it means enforcement is centralized and auditable.

The key flags and how they compose:

is_concurrency_safe(input): Runtime function, not a boolean field. Called by the dispatcher at dispatch time with the actual parsed input. Returns true if it's safe to run this specific invocation alongside other concurrent tools. When in doubt, return false. The performance cost of unnecessary serialization is much lower than the correctness cost of unintended concurrent writes.

is_read_only(input): Declares that the tool does not modify any persistent state. Used by the permission system to make fast-path allow decisions. A tool marked is_read_only may still require permission for other reasons (policy rules, explicit ask rules). It's one input to the permission decision, not a bypass.

is_destructive(input): Declares that the tool performs an irreversible operation: delete, overwrite, send. Default is false. The permission system and UI use this to surface extra confirmation. Only set to true when the operation genuinely cannot be undone: file deletion, email sending, database record removal.

interrupt_behavior(): Declares the cancel-or-block behavior described in the lifecycle section. This flag is read by the UI layer to determine whether a running tool can be stopped by the user.

requires_user_interaction(): Declares that the tool must interact with the user directly (for example, showing a dialog). Tools with this flag should not be called in non-interactive contexts (batch mode, background agents). The dispatcher checks this flag before execution and returns an error result if the session doesn't support interaction.

should_defer / always_load: Two ends of the tool visibility spectrum. should_defer marks a tool as deferred: its schema is not included in the initial model prompt. The model must explicitly search for and load the tool before calling it. always_load is the opposite: the tool's schema always appears in the prompt even when tool deferral is enabled for everything else. Use always_load for tools the model must discover on turn 1 without a search round-trip.

The composition rule at dispatch time: the dispatcher reads is_concurrency_safe to partition batches. The permission system reads is_read_only, is_destructive, and requires_user_interaction to make per-call permission decisions. The UI reads interrupt_behavior and is_destructive to determine what controls to show the user. No single system reads all the flags. Each system reads only what it needs.

Dynamic Tool Sets

Tools don't have to be static across the lifetime of an agent session. The tool context carries an optional refresh_tools callback. At the end of each loop iteration, after all tool results from that turn are complete, the loop calls refresh_tools() and compares the result to the current tool list. If they differ, the next iteration starts with the updated tool list.

type ToolContext = {
  options: {
    tools: Tool[]
    refresh_tools: () -> Tool[] | None  # optional callback
    # ...
  }
  # ...
}

# At end of each iteration, after tool results are complete:
if context.options.refresh_tools:
  fresh_tools = context.options.refresh_tools()
  if fresh_tools != context.options.tools:
    context = context.with(options=context.options.with(tools=fresh_tools))

# Next iteration sees the updated tool list

The key invariant: tools are immutable within a single iteration, potentially different on the next one. This is what makes dynamic tool sets safe. The model receives a consistent tool list for its entire response in a given turn. It can't be in the middle of requesting tools that are about to disappear.

This pattern enables several important capabilities:

MCP server connections mid-session: When an MCP server connects after the session starts, refresh_tools returns the new tool list and the agent immediately has access to the new tools on the next turn.
Conditional tool availability: The is_enabled() flag on each tool gates whether it's included in the current tool list. Tools can be temporarily unavailable based on session state.
Permission-based filtering: The dispatcher filters tools at dispatch time based on permission rules. A tool that the user has blocked won't be offered to the model even if it's in the registered list.
Deferred tools: Tools marked should_defer aren't included in the initial prompt. They only appear after the model explicitly searches for them and loads their schema. This keeps the initial prompt compact when the agent has access to hundreds of tools.

Schema and the LLM

The schema is how the model knows what to call. When the agent loop presents the model with a list of available tools, each tool's schema becomes part of the prompt. The model reads it and decides whether calling this tool would help accomplish the current task, and if so, what arguments to pass.

Schemas use JSON Schema format: type definitions, required fields, descriptions, and examples. The description field on each argument is especially important: it's the model's only guidance about what the argument means. Write argument descriptions as if the model is going to read them cold, with no other context. Because it will.

There's a sharp edge worth knowing: most LLM APIs that accept tool schemas do not support the full JSON Schema 2020-12 specification. In particular, they don't support $ref, $defs, or allOf (the composition mechanisms that JSON Schema uses to share definitions between fields). This means schemas with nested types often need post-processing before being sent to the API: all references must be inlined. If your tool schemas use Pydantic models or other schema-generating libraries, check whether they generate $ref-based schemas that will need flattening before dispatch.

Note: The two-phase validation pattern means schema errors and semantic errors are reported separately to the model, giving it better signal for retry. A schema error means "I passed the wrong type." A semantic error means "I passed the right type but the value was invalid." The model can use these distinct signals to construct a more targeted correction.

Fail-Closed Defaults

What happens when a tool forgets to declare a behavioral flag?

In most systems, a missing value means the default is permissive: undefined behavior is allowed. In a tool system, the opposite is safer. A missing flag should default to the most restrictive value.

A tool that doesn't declare is_concurrency_safe is treated as not concurrency-safe. A tool that doesn't declare requires_permission is treated as requiring permission. A tool that doesn't declare a concurrency class is treated as unsafe.

This is a deliberate design choice. The cost of treating a safe tool as unsafe is a small performance hit: it runs serially when it could have run in parallel. The cost of treating an unsafe tool as safe is data corruption, permission bypass, or worse. The asymmetry is obvious once you name it.

Note: Fail-closed defaults don't require special framework support. They're implemented as simple default values in the metadata schema: is_concurrency_safe = false, requires_permission = true, is_destructive = true. Any tool that explicitly overrides these defaults is making an affirmative claim that it's safe to relax them.

The build_tool factory function pattern centralizes these defaults:

TOOL_DEFAULTS = {
  is_enabled: () -> True,
  is_concurrency_safe: (_input) -> False,  # fail-closed
  is_read_only: (_input) -> False,          # fail-closed
  is_destructive: (_input) -> False,
  check_permissions: (_input, _ctx) -> allow(),
}

function build_tool(definition: ToolDef) -> Tool:
  return { ...TOOL_DEFAULTS, ...definition }

Every tool definition goes through build_tool. Any field the definition omits gets the conservative default. A tool that explicitly implements is_concurrency_safe to return True for certain inputs is making an affirmative safety claim, one that will be evaluated at runtime.

Production Considerations

Sibling abort: one tool failing can cancel its concurrent siblings

When tools run in a concurrent batch, one tool erroring mid-execution doesn't necessarily mean the others should finish. A child abort controller (distinct from the session-level abort controller) signals all concurrently running tools in the same batch to stop. The parent session is not aborted. Only the current tool batch is cancelled. Without this, a failing tool in a concurrent batch would allow its siblings to run to completion, wasting time and potentially producing results that will never be used.

Result size offload prevents context window monopolization

Every tool declares a max_result_size_chars field. When a tool result exceeds this limit, the content is persisted to a temporary file and the model receives a preview (the file path and a sample of the content) instead of the full result. This prevents a single large tool result (reading a 1MB log file, for example) from consuming most of the context window and crowding out other messages.

Tools that should never be offloaded (typically tools whose output is already bounded by their own limits) set max_result_size_chars = Infinity to opt out explicitly. This avoids a circular problem: a file-reading tool that offloads to a file would create a situation where the model reads the offload file with the same tool and risks offloading again.

Tool aliases enable backward-compatible renames

The tool interface supports an optional aliases field, a list of alternative names the tool will respond to in addition to its primary name. When the dispatcher looks up a tool by name, it checks both the primary name and the alias list.

This solves a real versioning problem: old conversation transcripts reference tools by name. If you rename a tool, replaying those transcripts would generate "no such tool" errors for every call to the old name. With aliases, you can rename KillShell to TaskStop and add KillShell to the alias list. Old transcripts continue to work without modification.

The fallback path for alias resolution is intentionally narrow: it only activates when the alias-matching tool is found in the global base tools registry, not the current session's tool list. This prevents unauthorized alias injection, where an external party claims to have a tool that matches an alias in hopes of hijacking the resolution path.

Input-dependent concurrency: treat parse failure as unsafe

Because is_concurrency_safe is called with parsed input, it can fail if the input fails to parse. The conservative handling (treating any exception in is_concurrency_safe as false) prevents a subtle class of bugs. If the function throws, it might be because the input is malformed in a way that makes concurrency unsafe. Defaulting to serial execution in that case is the right call: the performance cost is minimal, and the safety guarantee is preserved.

The broader pattern: never let a failure in the concurrency classification function cause optimistic concurrent dispatch. The asymmetry of correctness (serializing when concurrent would have been fine) vs. incorrectness (running concurrently when serial was required) strongly favors the conservative path.

Best Practices

Do declare is_concurrency_safe as a function, not a constant. A tool that is sometimes safe and sometimes not (depending on what it's called with) must inspect the actual input at dispatch time. A constant is_concurrency_safe = True that ignores input is dangerous. A read_file tool can return True. A bash tool must inspect the command.

Don't ignore semantic validation. Schema validation catches type errors. Semantic validation catches correctness errors. A tool that validates only schema will accept delete_file(path="/") as valid because the path is a string. Semantic validation is where you check that the path exists, is in scope, and doesn't point at something critical.

Do keep tool schemas narrow. The model hallucinates arguments it wasn't told about. A schema with 12 optional fields invites the model to pass fields that shouldn't be set. Design schemas with the minimum fields needed for each use case. If two use cases need genuinely different shapes, consider two tools.

Don't use is_destructive as a permission bypass. The is_destructive flag changes how the UI presents the operation: it triggers confirmation dialogs and highlights in the tool use view. It does not skip the permission check. A destructive tool still goes through the full permission lifecycle.

Do set max_result_size_chars on every tool. Never leave it unset (which would default to a potentially huge or zero limit). Pick a value appropriate to the expected output size. Tools that produce bounded output (a status code, a count) can set this high. Tools that read arbitrary file content should set it to something like 100KB-500KB to avoid context overflow.

Don't name tools vaguely. The model reads tool names and descriptions to decide which tool to call. "process_data" tells the model nothing. "extract_csv_rows" tells the model exactly when to reach for it. Precise naming reduces hallucination and improves first-call accuracy.

Do use always_load sparingly. Every tool with always_load increases the size of the initial prompt. In sessions with dozens of tools, that adds up. Reserve always_load for tools the model genuinely needs on turn 1: startup checks, user interaction tools, tools required for the model to discover other tools.

Don't forget to wire interrupt_behavior. The default is 'block', which is correct for most read operations. For long-running operations that the user might want to cancel (shell commands, network requests, file uploads) decide explicitly whether 'cancel' or 'block' is appropriate. A tool with 'block' will prevent the user from interrupting a runaway operation.

Agent Loop Architecture: The loop is what dispatches tool calls on each turn. Understanding how the loop calls tools, when it dispatches, how it handles errors, and what triggers the next turn gives the tool system its execution context.
Safety and Permissions: The permission system gates every tool call between validation and execution. This page explains the full permission cascade, graduated trust, and the classifier patterns that make auto-approval safe.
Error Recovery: Tool errors are one of the most common recovery scenarios in agent systems. This page covers retry strategies, circuit breakers, and how to design tools that signal recoverable vs. unrecoverable failures.
Streaming and Events: During streaming, tool use blocks arrive incrementally. Understanding the streaming model explains why tool dispatch must handle partial inputs, why the is_concurrency_safe check happens at the point of full input availability, and how concurrent tool results are interleaved in the event stream.
Command and Plugin Systems: Commands are the user-facing layer built on top of the tool system. The lazy-loading registry, metadata-first registration, and multi-source merge patterns extend the same principles that govern tool registration.
MCP Integration: MCP bridges external tools into the agent's tool system. The tool bridge pattern constructs Tool objects structurally identical to built-in tools, so the dispatcher handles them with the same concurrency and permission logic.
Hooks and Extensions: Hooks wrap tool execution at the PreToolUse and PostToolUse lifecycle events. Understanding the tool lifecycle phases makes hook timing and interception points clearer.
Multi-Agent Coordination: In multi-agent systems, tool sets are partitioned between coordinator and workers. The coordinator gets orchestration tools while workers get domain tools: the same registry, different filtered views.
Pattern Index: All patterns from this page in one searchable list, with context tags and links back to the originating section.
Glossary: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.