A user asks your agent a question. The agent makes an LLM call, waits for the full response, dispatches a tool, waits again, makes another LLM call, waits again, and finally returns the answer. The user stares at a blank screen for 15 seconds. They wonder if the agent is stuck. They consider reloading the page.
Streaming fixes this. Instead of waiting for the complete response, the agent yields events as they happen: text tokens as they generate, tool calls as they dispatch, results as they arrive. The user sees text appearing, tools being called, progress being made. Perceived latency drops from seconds to milliseconds, and the user can read the answer while it is still being generated.
But naive streaming (printing raw tokens to the screen) misses the architectural opportunity. A well-designed event system makes agent output composable, observable, and safe under load. This guide builds that system: a typed event model, a streaming agent loop, a consumer that renders events progressively, and backpressure handling for when the consumer cannot keep up.
The Event Model
The first decision is what to stream. Raw text chunks are not enough. The consumer needs to know whether it is receiving agent text, a tool call notification, a tool result, or a completion signal. Without this distinction, the consumer has to guess what each chunk means, and guessing leads to rendering bugs.
The solution is a typed event model: a set of named event types that the agent loop produces and consumers handle:
type AgentEvent =
| TextDelta { turn_id: str, text: str }
| ToolDispatch { turn_id: str, tool: str, args: dict }
| ToolResult { turn_id: str, tool: str, result: str, success: bool }
| Complete { turn_id: str, final_text: str }
| ErrorEvent { turn_id: str, error: str }Each event carries a turn_id that groups events from the same agent turn. A consumer can use the turn_id to associate a ToolResult with the ToolDispatch that started it, or to know which TextDelta events belong to the final answer versus an intermediate reasoning step.
The event model makes streaming composable. Any number of consumers can subscribe to the same event stream: a UI renderer, a logging system, a supervisor agent. Each consumer handles the event types it cares about and ignores the rest. Adding a new consumer requires zero changes to the agent loop.
Modify the Agent Loop
The standard agent loop from the quickstart returns a string. A streaming agent loop returns an async generator that yields events:
async function* streaming_agent_loop(question: str, tools: list) -> AsyncIterator[AgentEvent]:
messages = [system_message(prompt), user_message(question)]
for turn in range(max_turns):
turn_id = generate_id()
# Stream the LLM response: yield text deltas as they arrive
full_response = empty_response()
async for chunk in llm.stream(messages, tools=tools):
if chunk.text:
yield TextDelta(turn_id=turn_id, text=chunk.text)
full_response = merge(full_response, chunk)
# Check termination
if full_response.tool_calls is empty:
yield Complete(turn_id=turn_id, final_text=full_response.text)
return
# Dispatch tools and yield events for each
messages.append(full_response)
for call in full_response.tool_calls:
yield ToolDispatch(turn_id=turn_id, tool=call.name, args=call.args)
result = await dispatch_tool(call.name, call.args)
yield ToolResult(
turn_id=turn_id,
tool=call.name,
result=truncate(str(result), max_chars=500),
success=not result.is_error
)
messages.append(tool_result_message(call.id, result))
yield ErrorEvent(turn_id="overflow", error="agent exceeded max_turns")Two changes from the standard loop:
llm.stream() instead of llm.call(). The streaming API returns an async iterator of chunks instead of a complete response. Each chunk may contain a text fragment, and we yield a TextDelta event for each one. The chunks are also merged into full_response so we can check for tool calls after the stream completes.
yield instead of return. The function is an async generator (async function*). It yields events throughout execution and only returns (implicitly, at the end) when the agent completes. The caller consumes events as they arrive rather than waiting for the final result.
Build a Consumer
A consumer processes the event stream and does something useful with each event: rendering to a UI, logging to disk, or forwarding over a network connection.
The following renders events to a terminal UI:
async function render_to_terminal(event_stream: AsyncIterator[AgentEvent]):
async for event in event_stream:
match event:
TextDelta:
terminal.write(event.text) # append text as it arrives
ToolDispatch:
terminal.write_line(f"\n Calling {event.tool}...")
ToolResult:
if event.success:
terminal.write_line(f" {event.tool} completed.")
else:
terminal.write_line(f" {event.tool} failed.")
Complete:
terminal.write_line("\n---\nDone.")
ErrorEvent:
terminal.write_line(f"\nError: {event.error}")The consumer does not know how the events were produced. It does not know about the agent loop, the LLM, or the tools. It only knows the event types and what to do with each one. This decoupling is the value of the typed event model.
You can have multiple consumers processing the same stream. A logging consumer records every event to disk. A supervisor consumer watches for anomalies (too many tool calls, increasing cost). A network consumer serializes events to a WebSocket or SSE connection. Each consumer subscribes independently:
async function fan_out(event_stream: AsyncIterator[AgentEvent], consumers: list):
async for event in event_stream:
for consumer in consumers:
await consumer.handle(event)Handle Backpressure
The producer (agent loop) can generate events faster than the consumer can process them. A TextDelta event arrives every few milliseconds during streaming. A slow UI renderer or a network consumer with latency can fall behind. This is backpressure, the consumer pushing back against a fast producer.
Three strategies exist:
No buffer (blocking producer). The generator suspends at each yield until the consumer calls next(). The producer never runs ahead of the consumer. Maximum safety, but the producer is throttled to the consumer's speed.
Bounded buffer. The producer runs ahead up to N events, then blocks. Absorbs consumer jitter, meaning a consumer that processes events in bursts. The buffer size is the explicit trade-off: larger means smoother throughput, more memory use, and more events potentially lost if the consumer crashes.
Unbounded buffer. The producer never blocks. All events are queued immediately. Maximum throughput but unbounded memory use. Safe only when the consumer is reliably faster than the producer.
For UI streaming, the standard choice is a bounded buffer:
async function buffered_consumer(event_stream: AsyncIterator[AgentEvent], buffer_size: int = 20):
buffer = AsyncQueue(maxsize=buffer_size)
async function fill():
async for event in event_stream:
await buffer.put(event) # blocks if buffer is full
await buffer.put(SENTINEL)
spawn_background(fill)
while True:
event = await buffer.get()
if event is SENTINEL:
return
yield eventA buffer of 20 events handles the bursty render patterns of a typical UI without risking memory exhaustion. The producer blocks if it generates more than 20 events before the consumer processes any, which means a slow consumer throttles the producer naturally instead of letting the buffer grow without bound.
Tip: Buffer
ToolDispatchevents until the correspondingToolResultarrives. Showing "calling search_files..." and then immediately "search_files failed" is worse UX than waiting a moment and showing the complete outcome. Batch tool lifecycle events in the consumer, not the producer.
Progressive Disclosure
Streaming is not just about speed. It is about building trust. A user who can see the agent working trusts the agent more than one who sees a loading spinner. Progressive disclosure means showing useful intermediate state:
- Text tokens as they generate. The user starts reading before the response is complete.
- Tool calls as they dispatch. The user sees which tools the agent is using and can assess whether the approach is reasonable.
- Partial results before completion. If the agent is synthesizing from multiple sources, show each source's contribution as it arrives.
The agent loop already yields events at the right granularity. The consumer decides how to render them progressively. A simple consumer might show all events in order. A sophisticated consumer might group tool events, batch rapid TextDelta events for smoother rendering, and show a summary line for each completed tool rather than the full result.
Error Events
Errors during streaming must not break the event contract. If the LLM call fails or a tool throws an exception, the consumer needs to know, but it should receive an ErrorEvent, not a raw exception that terminates the stream.
The streaming agent loop wraps errors and yields them as events:
async function* safe_streaming_loop(question: str, tools: list) -> AsyncIterator[AgentEvent]:
try:
async for event in streaming_agent_loop(question, tools):
yield event
except LLMError as error:
yield ErrorEvent(turn_id="error", error=f"LLM call failed: {error}")
except ToolError as error:
yield ErrorEvent(turn_id="error", error=f"Tool error: {error}")
except Exception as error:
yield ErrorEvent(turn_id="error", error=f"Unexpected error: {error}")The consumer handles ErrorEvent like any other event type: displaying an error message, logging the failure, or triggering a retry. The stream contract is preserved regardless of what goes wrong inside the agent loop.
Related
- Streaming and Events. The full streaming architecture: the complete event type system, priority-based dispatch for terminal UIs, capture/bubble phases, screen-diffing output models, and the generator connection pattern.
- Agent Loop. The base loop pattern that the streaming loop modifies. Understanding how
llm.call()becomesllm.stream()and why the loop structure stays the same. - Tool System. How tool dispatch integrates with the streaming loop, and why tool results are yielded as events.