One capable agent with good tools is the right default. This chapter covers when to go multi-agent, the four topologies, durable execution, and the failure modes nobody advertises.
Start with one agent
Multi-agent is fashionable; it is also the most common source of self-inflicted complexity. A single agent with a well-designed toolbox handles a remarkable share of production work, is dramatically easier to debug, and costs less — every extra agent multiplies model calls and re-transmitted context. Go multi-agent only when you can name the concrete force pushing you there:
- Context isolation — subtasks individually overflow a window, or shouldn't see each other's data (research vs. billing).
- Parallelism — independent subtasks where wall-clock time matters — wide research, multi-file edits.
- Real specialization — genuinely different toolsets, models, or permissions per role — not just different prompt personas.
- Organizational seams — different teams own different agents, or an external partner's agent must be invoked as a black box (where A2A earns its keep).
The four topologies
A - Supervisor (default) Supervisor Agent A Agent B Agent C central routing + merge; easiest to debug B - Sequential pipeline Research Draft Review fixed hand-offs; deterministic order; context can thin at each hop C - Network / hand-off Sales Billing D - Hierarchical Director Team lead 1 Team lead 2 Tech Escalate
W W W W
peer hand-offs (OpenAI SDK style); flexible, harder to audit nested supervisors for very large tasks
Figure 6.1 — The four multi-agent topologies. Supervisor is the production default; choose others for a stated reason.
Supervisor (A) keeps routing, budgets, and result-merging in one accountable place — start here. Pipelines (B) suit role-shaped flows with a known order; their classic bug is context thinning, where nuance is lost at each hand-off, so pass structured briefs rather than chat summaries. Network hand-offs (C) — the OpenAI SDK's signature move — let the agent best suited to the moment take over; flexible, but trace it well or you'll never reconstruct who decided what. Hierarchies (D) are supervisors of supervisors for very large jobs; each layer adds latency and another telephone-game hop, so they must justify themselves.
State machines and durable execution
The orchestration insight that took the industry from demos to production is unglamorous: model the run as an explicit state machine with checkpoints. Frameworks like LangGraph build this in; Temporal-style durable-execution engines provide it underneath any framework. The payoff is enormous for real operations:
plan fetch data draft process crash / restart human approve execute log ckpt 4 ckpt 5 resume from checkpoint 3 — no lost work, no double-billing ckpt 6 ckpt 1 ckpt 2 ckpt 3 Red node = human-in-the-loop gate: the run pauses, state persists for hours or days, then resumes.
Figure 6.2 — Durable execution: checkpoints make crashes boring and human approvals first-class.
- Resumability — a crash, deploy, or rate-limit storm resumes from the last checkpoint instead of re-running (and re-billing) the whole job.
- Human-in-the-loop as a state — the run pauses at an approval gate, persists for hours or days, and resumes on sign-off — essential for refunds, contracts, payments.
- Time travel for debugging — replay any run from any checkpoint with modified state to reproduce a failure.
- Idempotency by design — checkpoint IDs become natural deduplication keys for side-effecting tools.
Multi-agent failure modes
Failure What it looks like Countermeasure Telephone-game loss Each hand-off drops nuance; agent 4 solves a different problem than agent 1 was given Pass structured briefs (goal, constraints, artifacts); let workers read source artifacts directly Runaway loops Two agents politely defer to each other forever; costs climb Hard step and token budgets per run; loop detection; supervisor owns termination Cost multiplication A 5-agent flow re-sends shared context 5x per round Shared state store instead of replayed transcripts; prompt caching (Ch. 9); fewer agents Conflicting writes Two workers update the same record divergently Single-writer ownership per resource; merge step at the supervisor; optimistic locks on tools Unattributable errors Something went wrong and no one can say which agent did it Per-agent tracing with run IDs end-to-end (Ch. 11); deterministic replay
Worked example: the research crew
A deep-research feature is the canonical legitimate multi-agent build: a planner decomposes the question, three searchers work strands in parallel with isolated contexts, a writer synthesizes from their structured notes, and a critic checks claims against sources before release. Anthropic's published account of building exactly this reported a multi-agent version strongly outperforming a single-agent baseline on breadth-heavy questions — while consuming several times the tokens. The pattern pays when the task is wide; it is overhead when the task is deep and sequential.
III
PART III
Engineering for the Real World
Production is where agent projects live or die. This part covers the five disciplines that separate demos from durable systems: platform independence, offline-first deployment, cost engineering, reliability and safety, and evaluation.