Home/Patterns/Core Systems/Prompt Architecture

Prompt Architecture

How the static/dynamic boundary in agent prompts affects cost, latency, and consistency, with the full assembly pipeline, five-level priority chain, memory injection, and production insights for cache management.

Most developers write their first agent prompt the way they write a chat message: pile everything in, see what works, tune from there. That approach breaks down fast. An agent prompt isn't a message you send once. It's an input that gets reconstructed every turn, sent to an expensive API, and shapes every decision the model makes for the entire session. Prompt structure is not cosmetic.

The insight that changes how you think about prompts: they have two zones. One zone stays identical across every user, every session, every turn, and it can be cached. The other zone changes per-session or per-turn, and it cannot. Where you draw the line between these zones determines your cost, your latency, your behavior consistency, and how easily you can test your agent in isolation. This is an architectural decision, not just a wording choice.

The Two-Zone Model

The system prompt splits into two zones at a structural boundary:

Static zone: content that is identical for every user, every session, every turn:

  • Identity (the opening sentence that primes the model's behavioral profile)
  • Behavioral rules (scope, verbosity, safety constraints)
  • Tool usage instructions (how to use the available tools)
  • Numeric calibration (confidence thresholds, response length limits)

Dynamic zone: content that varies per-session or per-turn:

  • Session context (working directory, operating system, model name)
  • User memory (retrieved long-term facts from previous sessions)
  • Active tool availability (if tools connect and disconnect mid-session)
  • Token budget remaining (changes as the session progresses)
function build_system_prompt(session: Session) -> string:
  # Static zone: identical for every user, every session, every turn
  static_sections = [
    identity_section(),          # "You are an agent that..."
    behavioral_rules_section(),  # scope, verbosity, calibration
    tool_usage_section(),        # how to use the available tools
  ]

  # --- STATIC/DYNAMIC BOUNDARY ---

  # Dynamic zone: varies per session or per turn
  dynamic_sections = [
    session_context(session),    # working directory, OS, model name
    user_memory(session),        # retrieved long-term facts
    active_tools(session),       # changes if tools connect/disconnect
  ]

  return join(static_sections + dynamic_sections)

Think of the static zone as a function signature: clear contracts that don't change between calls. Think of the dynamic zone as function arguments: variable inputs for this particular call. Mixing them creates the same problems as side-effectful code that reads global state. You can't test the static part without instantiating the dynamic part, you can't cache without worrying about invalidation, and small dynamic changes pollute the stable contract.

The cache fragmentation problem. Prefix caching works by recognizing byte-identical prefixes across API calls. If any session-variable content appears inside the static zone (say, the working directory injected into the rules section), the cache key differs between users and sessions. With N session-variable bits interleaved in the static content, you get 2^N possible cache keys. Move all variable content below the boundary, and you get one cache key for the entire static prefix. Every call from every user on every turn hits the same cached prefix.

Identity Design

The first sentence of the system prompt matters more than any other. The model reads identity before any rules, and identity primes the entire behavioral profile: how assertively the model speaks, how much it delegates vs. acts, how it frames uncertainty.

Different operational modes need different identities. The verbs encode behavior:

# Different identities prime different behaviors
assistant_identity = "You help users accomplish tasks by reading, searching, and editing files."
coordinator_identity = "You orchestrate complex tasks by delegating to specialized worker agents."
worker_identity = "You are an agent for code analysis. You read files and report findings."

"Helps users" means responsive, user-centric, conversational. "Orchestrates" means delegates, coordinates, doesn't do the work itself. "An agent for" means subordinate, task-scoped, narrow.

These aren't soft style choices. A coordinator agent that thinks it "helps users" will try to solve problems directly instead of delegating. A worker agent that thinks it "orchestrates" will spawn sub-agents instead of executing. Identity and role must match.

Identity belongs in the static zone. It never changes between sessions because it defines what kind of agent this is, not anything about the current session.

Calibration Through Numbers

Behavioral calibration expressed as explicit numeric constraints is more reliable than vague adjectives. The model doesn't have a shared definition of "concise" or "careful" with you. It interpolates from training data, which varies. Numbers are unambiguous.

Compare:

VagueNumeric
"Be concise""Maximum 3 sentences per explanation unless the user asks for more"
"Be careful with destructive operations""Only run destructive operations when confidence exceeds 0.95"
"Don't create too many files""Maximum 5 new files per task unless the user explicitly requests more"
"Ask clarifying questions when unsure""Ask at most 2 clarifying questions before attempting the task with your best interpretation"

The numeric version is also easier to test: you can write a simple check that counts sentences, files created, or questions asked. The vague version requires subjective judgment.

Calibration belongs in the static zone. It defines the agent's operating envelope, meaning constraints that apply equally to all users and all sessions. If a particular user wants different limits, that's a session-level configuration that goes in the dynamic zone as a preference, not a modification to the static rules.

The Prompt Assembly Pipeline

Building a prompt by concatenating strings is a recipe for fragile cache management. A better approach: a section registry where each piece of prompt content is registered as a named section with an explicit cache intent.

The two-function API encodes cache intent at registration time:

# Static zone: computed once, cached across turns (memoized)
identity_section = register_cached_section(
  name: "identity",
  compute: () => "You are an agent that helps users accomplish tasks..."
)

behavioral_rules_section = register_cached_section(
  name: "rules",
  compute: () => load_static_rules()
)

# Dynamic zone: recomputed every turn (cache-breaking).
# The verbose name forces the caller to justify why this section
# cannot be cached. It's intentional friction, not just naming style.
token_budget_section = register_volatile_section(
  name: "token_budget",
  compute: () => f"Remaining context: {get_remaining_tokens()} tokens",
  reason: "Token count changes every turn and cannot cache"
)

memory_section = register_volatile_section(
  name: "user_memory",
  compute: () => load_memory_prompt(),
  reason: "Memory file content changes between sessions"
)

The naming convention for the volatile variant is itself a safety mechanism. Making the cache-breaking registration intentionally verbose (requiring a reason argument) means developers cannot reach for it casually. Every cache-breaking section has a documented justification. When you audit prompt performance, you read the reasons, not the code.

The five-level prompt priority chain. When building the final effective prompt, different modes and configurations need to override or append to the base. A priority chain resolves this cleanly:

function build_effective_prompt(config: PromptConfig) -> string:
  # Level 0: override prompt, replaces everything (specialized modes only)
  if config.override_prompt:
    return config.override_prompt + append_tail

  # Level 1: coordinator prompt (coordinator/orchestrator mode)
  if config.coordinator_prompt:
    base = config.coordinator_prompt

  # Level 2: agent system prompt (from agent definition)
  # Replaces or appends to default depending on agent mode
  elif config.agent_system_prompt:
    if config.agent_mode == "append":
      base = default_prompt + config.agent_system_prompt
    else:
      base = config.agent_system_prompt

  # Level 3: custom system prompt (user-provided flag)
  elif config.custom_system_prompt:
    base = default_prompt + config.custom_system_prompt

  # Level 4: default system prompt (base case)
  else:
    base = default_prompt

  # Append-tail: always appended, except to override prompts
  return base + append_tail

The append-tail pattern deserves attention. It's a safety valve that injects content (memory correction hints, team policy additions, per-session overrides) outside the priority chain. Instead of modifying a static zone section (which would fragment cache keys) or overriding the entire prompt (which would lose the default loop behavior), you append to the tail. The tail is always present, always last, and independent of which level won the priority chain.

Memory as Dynamic Zone Content

Memory is not separate from the prompt. It is part of the prompt. Long-term facts, user preferences, and project context are injected into the dynamic zone each turn, updating the model's knowledge of the current session state.

This creates a direct coupling between the memory system and prompt structure: the memory system must respect prompt-layer constraints. A memory manifest that grows unboundedly will eventually consume a disproportionate share of the context budget before the conversation even begins. The manifest (a single index listing all memory files) must be capped at a line count and byte limit. Without these caps, a session with many long memory entries will exhaust its context budget before the agent can respond.

function load_memory_prompt(memory_dir: str) -> str:
  manifest = read_manifest(memory_dir)

  # Cap manifest: long manifests consume context budget before the conversation begins
  manifest = truncate_manifest(
    manifest,
    max_lines: 200,
    max_bytes: 25_000
  )

  facts = load_fact_files(manifest.entries)
  return format_memory_section(manifest, facts)

Why the byte cap in addition to the line cap? Manifests that are under the line limit but contain very long lines (long file paths, long descriptions) can still be large. The byte cap catches this failure mode.

Memory content flows into the dynamic zone through the volatile section registration path. This means memory changes are not cached. Each turn loads fresh memory content and includes it in the prompt. This is correct behavior: memory is updated between turns (by the background extractor), so stale cached memory would defeat the purpose.

This coupling has an important implication for prompt budget planning: the memory section is not a fixed cost. It grows as the user accumulates preferences, project decisions, and feedback across sessions. A user who has interacted with the agent for six months will have significantly more memory content than a new user. Design the context budget assuming the memory section can consume 5-15% of the effective context window in active sessions. The caps are your safety valve when it grows larger than expected.

Production Considerations

Cache clear on compaction is a one-time but necessary cost. When the conversation compacts, the message history changes. Any memoized prompt sections that included message-dependent content (turn count, token budget, memory references) must be invalidated and re-evaluated. The re-evaluation cost is bounded and one-time, but it's real: every cached section is recomputed after compaction. If you're using a lazy initialization pattern for heavy sections (expensive DB lookups, large file loads), factor in the re-evaluation cost when compaction occurs.

The verbose volatile registration name is a security mechanism. Making the cache-breaking variant require a reason argument does more than enforce documentation. It creates friction that prevents developers from accidentally using it for content that could be static. In a team setting, the reason field is the first thing you read in a prompt audit. If the reason says "changes per user" but the content is actually the same for all users, that's an easy fix to find. Without the reason requirement, cache fragmentation silently accumulates across the team's contributions.

Proactive mode composition pattern. When an agent mode appends its instructions to the default prompt rather than replacing it, the agent gains domain-specific behavior on top of the default loop control structure. The anti-pattern is replacing the default prompt entirely, which loses the autonomous loop behavior embedded in the default identity and rules sections. Specialized agents should always append domain instructions. Only root-level loop modes should override the entire prompt.

Cache fragmentation math is not theoretical at team scale. With N interleaved dynamic bits in the static zone, 2^N cache keys. In a codebase maintained by a team, these bits accumulate over months: somebody adds a per-user flag here, a session-variable there. Each addition doubles the cache keyspace. The section registry with its explicit cache-intent declaration prevents this because the static/dynamic boundary is enforced structurally, not by convention.

The two-function API also makes auditing tractable. When you need to understand why caching is underperforming, the section registry gives you a complete inventory: every registered section, its cache intent, and its declared reason if volatile. Without the registry, tracking down cache-breaking content means reading the entire prompt assembly code path and identifying variable substitutions by hand. With the registry, it's a one-liner to list all volatile sections and their reasons.

Best Practices

  • DO separate prompt content into individually registered sections with explicit cache intent (cached vs volatile)
  • DO put identity, rules, and calibration in the static zone. They're identical across users, sessions, and turns.
  • DO require a reason argument when registering a volatile section. It creates friction that prevents casual cache-breaking.
  • DO use the append-tail for memory hints, team policy additions, and per-session overrides
  • DO clear the section cache on compaction and session reset. Memoized sections must re-evaluate to pick up session-level changes.
  • DO cap memory manifest at line count and byte limit before injecting it into the dynamic zone
  • DON'T interleave variable content in the static zone. Each bit doubles the number of cache keys.
  • DON'T have agent modes replace the entire default prompt. Append domain instructions to preserve the loop control structure.
  • DON'T inject unbounded memory content into the prompt. The memory system and prompt structure share the same context budget.
  • Memory and Context: Memory determines what fills the dynamic zone. Retrieved long-term facts and session context are assembled each turn and injected into the prompt. The memory system's size constraints directly affect available context budget.

  • Agent Loop Architecture: The loop sends the assembled prompt at the start of every turn. The static zone is the portion the loop can cache across calls. The dynamic zone is rebuilt each turn.

  • Tool System: Tool descriptions are part of the static zone. Understanding how tool definitions are structured and how they consume static budget connects prompt architecture to tool design.

  • Error Recovery: Retry behavior can be calibrated in the static zone via numeric constraints. Understanding what goes in the prompt vs what goes in the retry configuration clarifies the division of responsibility.

  • Hooks and Extensions: Hooks are the primary extension mechanism for prompt assembly. The UserPromptSubmit and PreToolUse hooks can inject, modify, or gate prompt content at defined lifecycle points without touching the static/dynamic zone structure directly.

  • Pattern Index: All patterns from this page in one searchable list, with context tags and links back to the originating section.

  • Glossary: Definitions for all domain terms used on this page, from agent loop primitives to memory system concepts.