Part III: Production Engineering

Chapter 09

Credit-Smart: Cost Optimization for Online APIs

Control AI agent API costs with prompt caching, model routing, cascades, batching, semantic caches, and per-task cost tracing.

Agent loops multiply tokens, and tokens are the bill. The good news: a handful of disciplined techniques routinely cut production costs by well over half — without touching quality.

Why agents cost more than chatbots

A chatbot answers once. An agent loops — and on every step it re-sends the system prompt, the tool catalogue, and the accumulated history. A ten-step run with a 6,000-token prefix pays for that prefix ten times: input tokens, not output tokens, dominate agent bills. That is also the good news, because repeated input is exactly what the optimization stack attacks. One caveat before optimizing anything: if a workload costs under a few hundred dollars a month, your engineering time is the expensive part — ship features instead.

The five-step stack

1 · Measure
per-task token + $ tracing; you cannot cut what you cannot see
2 · Cache
prompt caching on stable prefixes — typically the single biggest lever
3 · Route
cascade: small model first, frontier only on low confidence
4 · Compress
context engineering — summaries, JIT retrieval, trimmed tool output
5 · Batch + semantic cache
50% off non-urgent jobs; reuse answers to repeated questions
stacked
savings

70-90%

typical, high-volume
production agents

Figure 9.1 — Apply in order. Each step compounds with the previous ones.

  • Measure — per-task cost tracing (Chapter 11 tooling) before any tuning. Most teams discover two or three call sites generate most of the spend.
  • Cache — prompt caching reuses the processed prefix — system prompt, tool definitions, stable examples — across calls. Cached reads are billed at roughly 10% of normal input price; Anthropic charges about a 25% premium to write the cache, and OpenAI caches automatically on prompts above 1,024 tokens. Structure prompts stable-first, volatile-last, and typical agent workloads save 45-80% on input costs.
  • Route — use a cascade — a small, cheap model handles classification, extraction and easy replies; only low-confidence or high-stakes cases escalate. Budget-tier models in 2026 cost cents per million tokens (DeepSeek's V4-Flash class sits around $0.14 in / $0.28 out), one to two orders of magnitude below frontier pricing.
  • Compress — the context-engineering toolkit from Chapter 5 — compaction thresholds, scratchpads, just-in-time retrieval, trimmed tool outputs — directly cuts the tokens every step re-sends.
  • Batch + semantic cache — non-urgent work (nightly enrichment, report generation) goes through batch APIs at 50% off; a semantic cache returns stored answers to near-duplicate questions without any model call.
low confidence /
policy triggers
Incoming task
Small / cheap model
classify · extract · easy replies
handles ~70-90% end-to-end
Frontier model
reasoning-heavy minority

Figure 9.2 — The cascade: the cheap model is the workforce, the frontier model is the specialist.

A worked example

A support agent handles 100,000 requests a month, averaging 8 steps, with a 5,500-token stable prefix and ~1,200 volatile tokens per step on a frontier model at $3 per million input tokens:

Stage What changes Monthly input cost
Baseline full prefix re-sent every step $16,080
+ Prompt caching prefix cached; reads at ~10% $4,460
+ Cascade routing 70% of requests stay on a budget model $1,720
+ Compression history compaction trims ~25% of volatile tokens $1,390

Numbers are illustrative but the shape is what teams report in production: the first two steps do most of the work, and a 70-90% total reduction is a normal outcome for high-volume agents — which is often the difference between a project that scales and one that gets shut down.

45-80%

typical input-cost saving from
prompt caching alone
provider guidance, 2025-26

50%

discount on batch-API workloads
OpenAI / Anthropic batch pricing

70-90%

total reduction from the full
stack at volume
production case write-ups

Anti-patterns

  • Caching highly personalised content — unique prefixes never get cache hits; you pay the write premium for nothing.
  • Routing by length instead of difficulty — short questions can be hard; use confidence or a learned classifier.
  • Compressing away the evidence — over-aggressive summarisation deletes the facts the agent needs, and quality pays the bill.
  • Optimizing before measuring — without per-task tracing you will tune the wrong call site.