Teams that measure improve; teams that demo stall. This chapter covers tracing, the four kinds of evals, the metrics that predict production success, and the weekly flywheel that ties them together.
Tracing: see every step
Observability for agents means capturing the full trajectory — every model call, tool call, retrieval, token count and latency, linked per run. The 2026 tooling is mature: LangSmith (deepest LangChain/LangGraph integration), Langfuse (open-source, self-hostable — a common choice where data residency matters), and Arize Phoenix (strong eval tooling), all converging on OpenTelemetry's GenAI conventions so traces flow into the monitoring stack you already run. Instrument from day one: the traces are also the raw material for your eval suite.
The four kinds of evals
Level Question it answers Example check Unit did one capability work? given this email, is the extracted JSON exactly right? Trajectory did the agent take sensible steps? searched before answering; no loops; ≤ N steps Outcome was the task actually completed? ticket resolved and customer confirmed, regardless of path LLM-as-judge how good is unstructured output? rubric-scored draft quality — calibrated against human labels
Build the suite from reality, not imagination: harvest failed and excellent production traces into labelled cases. Thirty real cases beat three hundred synthetic ones. Treat LLM-as-judge scores with care — judges drift and flatter; spot-check them against human labels monthly.
Metrics that matter
- Task success rate — outcome-level, the headline number — defined by your rubric, not vibes.
- Escalation rate — share of runs handed to a human. Falling escalation at stable quality is the cleanest sign of real progress.
- Steps and tokens per task — efficiency and a leading indicator of loops and confusion.
- Cost per resolved task — the number finance asks about — success and spend in one ratio.
- p95 latency — agents are slow by nature; know your tail before your users do.
The improvement flywheel
Trace every production run Gate CI blocks regressions Cluster group failures ship weekly, improve weekly Fix prompts · tools · routing Encode failures become evals
Figure 11.1 — Traces become evals; evals gate releases; releases generate better traces.
The loop runs weekly in healthy teams: review traces, cluster the failures, turn the clusters into eval cases, fix prompts, tools or routing, and let CI block any change that regresses the suite. Canary new versions to a slice of traffic and compare metrics before full rollout — agents are too stochastic for 'it worked on my machine'.
Pre-launch checklist
Before launch: eval suite ≥ 30 real cases with a pass bar · tracing on 100% of runs · budgets and kill-switch wired · escalation path staffed · rollback rehearsed · weekly trace-review booked. Six lines that prevent most week-one incidents.
IV
PART IV
From Blueprint to Business
Everything so far becomes real here: a method for designing an agent around a specific use case, the documented results of teams who shipped, and a 90-day playbook for going from idea to measured pilot.