Orchestration: Single Agents to Multi-Agent Systems

One capable agent with good tools is the right default. This chapter covers when to go multi-agent, the four topologies, durable execution, and the failure modes nobody advertises.

Start with one agent

Multi-agent is fashionable; it is also the most common source of self-inflicted complexity. A single agent with a well-designed toolbox handles a remarkable share of production work, is dramatically easier to debug, and costs less, every extra agent multiplies model calls and re-transmitted context. Go multi-agent only when you can name the concrete force pushing you there:

Context isolation, subtasks individually overflow a window, or shouldn't see each other's data (research vs. billing).
Parallelism, independent subtasks where wall-clock time matters, wide research, multi-file edits.
Real specialization, genuinely different toolsets, models, or permissions per role, not just different prompt personas.
Organizational seams, different teams own different agents, or an external partner's agent must be invoked as a black box (where A2A earns its keep).

The four topologies

A - Supervisor (default)
Supervisor
Agent A Agent B Agent C
central routing + merge; easiest to debug
B - Sequential pipeline
Research Draft Review
fixed hand-offs; deterministic order;
context can thin at each hop
C - Network / hand-off
Sales Billing
D - Hierarchical
Director
Team lead 1
Team lead 2
Tech Escalate

W W W W

peer hand-offs (OpenAI SDK style); flexible, harder to audit
nested supervisors for very large tasks

Figure 6.1. The four multi-agent topologies. Supervisor is the production default; choose others for a stated reason.

Supervisor (A) keeps routing, budgets, and result-merging in one accountable place, start here. Pipelines (B) suit role-shaped flows with a known order; their classic bug is context thinning, where nuance is lost at each hand-off, so pass structured briefs rather than chat summaries. Network hand-offs (C), the OpenAI SDK's signature move, let the agent best suited to the moment take over; flexible, but trace it well or you'll never reconstruct who decided what. Hierarchies (D) are supervisors of supervisors for very large jobs; each layer adds latency and another telephone-game hop, so they must justify themselves.

State machines and durable execution

The orchestration insight that took the industry from demos to production is unglamorous: model the run as an explicit state machine with checkpoints. Frameworks like LangGraph build this in; Temporal-style durable-execution engines provide it underneath any framework. The payoff is enormous for real operations:

plan
fetch data
draft
process crash / restart
human approve
execute
log
ckpt 4
ckpt 5
resume from checkpoint 3 — no lost work, no double-billing
ckpt 6
ckpt 1
ckpt 2
ckpt 3
Red node = human-in-the-loop gate: the run pauses, state persists for hours or days, then resumes.

Figure 6.2. Durable execution: checkpoints make crashes boring and human approvals first-class.

Resumability, a crash, deploy, or rate-limit storm resumes from the last checkpoint instead of re-running (and re-billing) the whole job.
Human-in-the-loop as a state, the run pauses at an approval gate, persists for hours or days, and resumes on sign-off, essential for refunds, contracts, payments.
Time travel for debugging, replay any run from any checkpoint with modified state to reproduce a failure.
Idempotency by design, checkpoint IDs become natural deduplication keys for side-effecting tools.

Multi-agent failure modes

Failure What it looks like Countermeasure
Telephone-game
loss
Each hand-off drops nuance; agent 4 solves
a different problem than agent 1 was given
Pass structured briefs (goal, constraints, artifacts);
let workers read source artifacts directly
Runaway loops Two agents politely defer to each other
forever; costs climb
Hard step and token budgets per run; loop
detection; supervisor owns termination
Cost
multiplication
A 5-agent flow re-sends shared context 5x
per round
Shared state store instead of replayed transcripts;
prompt caching (Ch. 9); fewer agents
Conflicting writes Two workers update the same record
divergently
Single-writer ownership per resource; merge step
at the supervisor; optimistic locks on tools
Unattributable
errors
Something went wrong and no one can say
which agent did it
Per-agent tracing with run IDs end-to-end (Ch. 11);
deterministic replay

Worked example: the research crew

A deep-research feature is the canonical legitimate multi-agent build: a planner
decomposes the question, three searchers work strands in parallel with isolated contexts, a
writer synthesizes from their structured notes, and a critic checks claims against sources
before release. Anthropic's published account of building exactly this reported a multi-agent
version strongly outperforming a single-agent baseline on breadth-heavy questions — while
consuming several times the tokens. The pattern pays when the task is wide; it is overhead
when the task is deep and sequential.

III

PART III

Engineering for the Real World

Production is where agent projects live or die. This part covers the five disciplines that separate demos from durable systems: platform independence, offline-first deployment, cost engineering, reliability and safety, and evaluation.

← Memory: From Context Windows to Knowledge Graphs Platform Independence & Portability →