Credit-Smart: Cost Optimization for Online APIs | AI Agent Engineering Handbook

Agent loops multiply tokens, and tokens are the bill. The good news: a handful of disciplined techniques routinely cut production costs by well over half, without touching quality.

Why agents cost more than chatbots

A chatbot answers once. An agent loops, and on every step it re-sends the system prompt, the tool catalogue, and the accumulated history. A ten-step run with a 6,000-token prefix pays for that prefix ten times: input tokens, not output tokens, dominate agent bills. That is also the good news, because repeated input is exactly what the optimization stack attacks. One caveat before optimizing anything: if a workload costs under a few hundred dollars a month, your engineering time is the expensive part, ship features instead.

The five-step stack

1 · Measure
per-task token + $ tracing; you cannot cut what you cannot see
2 · Cache
prompt caching on stable prefixes — typically the single biggest lever
3 · Route
cascade: small model first, frontier only on low confidence
4 · Compress
context engineering — summaries, JIT retrieval, trimmed tool output
5 · Batch + semantic cache
50% off non-urgent jobs; reuse answers to repeated questions
stacked
savings

70-90%

typical, high-volume
production agents

Figure 9.1. Apply in order. Each step compounds with the previous ones.

Measure, per-task cost tracing (Chapter 11 tooling) before any tuning. Most teams discover two or three call sites generate most of the spend.
Cache, prompt caching reuses the processed prefix, system prompt, tool definitions, stable examples, across calls. Cached reads are billed at roughly 10% of normal input price; Anthropic charges about a 25% premium to write the cache, and OpenAI caches automatically on prompts above 1,024 tokens. Structure prompts stable-first, volatile-last, and typical agent workloads save 45-80% on input costs.
Route, use a cascade, a small, cheap model handles classification, extraction and easy replies; only low-confidence or high-stakes cases escalate. Budget-tier models in 2026 cost cents per million tokens (DeepSeek's V4-Flash class sits around $0.14 in / $0.28 out), one to two orders of magnitude below frontier pricing.
Compress, the context-engineering toolkit from Chapter 5, compaction thresholds, scratchpads, just-in-time retrieval, trimmed tool outputs, directly cuts the tokens every step re-sends.
Batch + semantic cache, non-urgent work (nightly enrichment, report generation) goes through batch APIs at 50% off; a semantic cache returns stored answers to near-duplicate questions without any model call.

low confidence /
policy triggers
Incoming task
Small / cheap model
classify · extract · easy replies
handles ~70-90% end-to-end
Frontier model
reasoning-heavy minority

Figure 9.2. The cascade: the cheap model is the workforce, the frontier model is the specialist.

A worked example

A support agent handles 100,000 requests a month, averaging 8 steps, with a 5,500-token stable prefix and ~1,200 volatile tokens per step on a frontier model at $3 per million input tokens:

Stage What changes Monthly input cost
Baseline full prefix re-sent every step $16,080
+ Prompt caching prefix cached; reads at ~10% $4,460
+ Cascade routing 70% of requests stay on a budget model $1,720
+ Compression history compaction trims ~25% of volatile tokens $1,390

Numbers are illustrative but the shape is what teams report in production: the first two steps do most of the work, and a 70-90% total reduction is a normal outcome for high-volume agents, which is often the difference between a project that scales and one that gets shut down.

45-80%

typical input-cost saving from
prompt caching alone
provider guidance, 2025-26

50%

discount on batch-API workloads
OpenAI / Anthropic batch pricing

70-90%

total reduction from the full
stack at volume
production case write-ups

Anti-patterns

Caching highly personalised content, unique prefixes never get cache hits; you pay the write premium for nothing.
Routing by length instead of difficulty, short questions can be hard; use confidence or a learned classifier.
Compressing away the evidence, over-aggressive summarisation deletes the facts the agent needs, and quality pays the bill.
Optimizing before measuring, without per-task tracing you will tune the wrong call site.

← Offline-First & Local Agents Scaling, Reliability & Safety Engineering →