Designing Custom Agents for Your Use Case | AI Agent Engineering Handbook

There is no best agent, only the right agent for one workflow, one data reality, and one risk appetite. This chapter is the design method: discovery, autonomy, architecture mapping, and a spec you can hand to a builder.

Discovery: ten questions before any code

Most failed agent projects were lost before engineering began, wrong workflow, fuzzy success criteria, or data that wasn't there. Run discovery as a working session with the people who do the job today, and leave with written answers to:

What exact workflow, end to end? Walk a real example, not the org chart's version of it.
How often does it run, and what does each run cost today in time and money?
What does a failure cost, and is it reversible? (A wrong draft is cheap; a wrong payment is not.)
Where does the knowledge live, systems, documents, or someone's head?
Which systems must the agent read or write, and do APIs exist?
What does 'done well' mean, measurably? This sentence becomes your eval rubric.
Who reviews, who approves, who gets the escalations?
What data may leave the building, and what must never? (Residency rules decide architecture.)
What volume in 12 months if it works? Build for that, not for the demo.
Who owns the agent after launch, its prompts, evals, and weekly flywheel?

Pick the autonomy level deliberately

Autonomy is a dial, not a binary, and the right setting comes from failure cost and trust earned, not ambition. Ship one level below where you think you belong, instrument everything, and earn your way up with eval evidence.

rising autonomy → rising blast radius → rising need for evals, budgets and audit

L5 L4 L3 L0

Scripted
automation
no model in the loop

L2 L1

Assist
drafts & suggestions;
human does the work
Approve
agent acts after explicit
human sign-off
Supervise
agent acts; human
reviews samples &
exceptions
Delegate
agent owns the task;
escalates by policy

Figure 12.1. The autonomy ladder. Most successful first deployments launch at L2-L3.

From answers to architecture

Discovery answers map almost mechanically onto the choices from Parts II and III:

Autonomous
agent owns the
outcome end-to-end
Discovery finding Design consequence Where
Predictable process, steps known workflow with LLM steps, not a free agent Ch. 2
Open-ended, branching,
judgment-heavy
agent loop; add planning + reflection Ch. 2
Multiple systems to touch MCP servers per system; typed tool contracts Ch. 4
Needs to remember users/cases
over time
scoped memory layer + write policy Ch. 5
Pause for approvals; long-running durable execution, checkpoints, HITL gates Ch. 6
Strict data residency local/hybrid serving; self-hosted gateway & tracing Ch. 7-8
High volume, cost-sensitive caching + cascade routing from day one Ch. 9
Irreversible or high-value actions L2-L3 autonomy, approval gates, budgets Ch. 10
Quality disputes likely eval suite + tracing before launch, not after Ch. 11

Build, buy, or assemble

Buy a finished product when your workflow is genuinely commodity (generic meeting notes, first-line IT FAQ) and differentiation doesn't matter. Build on frameworks plus your own interfaces when the workflow is your business, your pricing logic, your service playbook, your data. The middle path, assembling vendor agents behind protocol seams (MCP for tools, A2A between agents), is increasingly the pragmatic default: buy the commodity edges, build the differentiating core. Whatever you choose, the evals, budgets and audit trail are always yours to own.

Worked spec, a real-estate lead qualifier

A brokerage receives hundreds of portal and WhatsApp enquiries weekly; agents waste hours on unqualified leads and respond slowly to good ones. Discovery says: high volume, modest failure cost (a misrouted lead), bilingual audience, CRM is the system of record, response speed is the KPI. The spec that falls out:

Objective, respond to every enquiry in under 2 minutes, qualify against budget / area / timeline / financing, and book viewings for qualified leads.
Autonomy, L3, messages send automatically; pricing commitments and complaints escalate to a human within the same thread.
Pattern, router + single agent loop; no multi-agent topology needed at this volume.
Tools (via MCP), CRM read/write, listings search, calendar booking, WhatsApp Business send, each schema-validated, send-rate budgeted.
Memory, per-lead profile (facts + preferences) with 12-month decay; no cross-lead recall by policy.
Models, budget model for classification and extraction; frontier model for negotiation-tone drafting; prompt caching on the listing-policy prefix.
Evals, 40 labelled historical enquiries, qualification accuracy ≥ 90%, zero pricing commitments, Arabic quality spot-checked by a native speaker.
Success metric, median response < 2 min; ≥ 25% more viewings booked per 100 enquiries within 8 weeks, at agreed cost per lead.

The one-page agent spec

One page, eight headings: Objective · Autonomy level · Pattern · Tools & data · Memory
policy · Models & cost plan · Eval set & pass bar · Owner & escalation path. If you cannot fill
all eight, you are not ready to build — you are ready for more discovery.

← Evaluation & Observability Case Studies & Field Patterns →