Part III: Production Engineering

Chapter 11

Evaluation & Observability

How to evaluate and observe AI agents with traces, outcome metrics, trajectory tests, LLM judges, dashboards, and an improvement flywheel.

Teams that measure improve; teams that demo stall. This chapter covers tracing, the four kinds of evals, the metrics that predict production success, and the weekly flywheel that ties them together.

Tracing: see every step

Observability for agents means capturing the full trajectory — every model call, tool call, retrieval, token count and latency, linked per run. The 2026 tooling is mature: LangSmith (deepest LangChain/LangGraph integration), Langfuse (open-source, self-hostable — a common choice where data residency matters), and Arize Phoenix (strong eval tooling), all converging on OpenTelemetry's GenAI conventions so traces flow into the monitoring stack you already run. Instrument from day one: the traces are also the raw material for your eval suite.

The four kinds of evals

Level Question it answers Example check
Unit did one capability work? given this email, is the extracted JSON exactly right?
Trajectory did the agent take sensible steps? searched before answering; no loops; ≤ N steps
Outcome was the task actually completed? ticket resolved and customer confirmed, regardless of
path
LLM-as-judge how good is unstructured output? rubric-scored draft quality — calibrated against human
labels

Build the suite from reality, not imagination: harvest failed and excellent production traces into labelled cases. Thirty real cases beat three hundred synthetic ones. Treat LLM-as-judge scores with care — judges drift and flatter; spot-check them against human labels monthly.

Metrics that matter

  • Task success rate — outcome-level, the headline number — defined by your rubric, not vibes.
  • Escalation rate — share of runs handed to a human. Falling escalation at stable quality is the cleanest sign of real progress.
  • Steps and tokens per task — efficiency and a leading indicator of loops and confusion.
  • Cost per resolved task — the number finance asks about — success and spend in one ratio.
  • p95 latency — agents are slow by nature; know your tail before your users do.

The improvement flywheel

Trace
every production run
Gate
CI blocks regressions
Cluster
group failures
ship weekly,
improve weekly
Fix
prompts · tools · routing
Encode
failures become evals

Figure 11.1 — Traces become evals; evals gate releases; releases generate better traces.

The loop runs weekly in healthy teams: review traces, cluster the failures, turn the clusters into eval cases, fix prompts, tools or routing, and let CI block any change that regresses the suite. Canary new versions to a slice of traffic and compare metrics before full rollout — agents are too stochastic for 'it worked on my machine'.

Pre-launch checklist

Before launch: eval suite ≥ 30 real cases with a pass bar · tracing on 100% of runs ·
budgets and kill-switch wired · escalation path staffed · rollback rehearsed · weekly
trace-review booked. Six lines that prevent most week-one incidents.

IV

PART IV

From Blueprint to Business

Everything so far becomes real here: a method for designing an agent around a specific use case, the documented results of teams who shipped, and a 90-day playbook for going from idea to measured pilot.