Part III: Production Engineering

Chapter 08

Offline-First & Local Agents

Local AI agent deployment with Ollama, vLLM, llama.cpp, quantization, hardware sizing, and hybrid edge-cloud architectures.

Some agents must run where the internet doesn't, or where the data must never leave. This chapter covers the local runtimes, hardware sizing, and the hybrid pattern that gives you both privacy and frontier quality.

When local is the right call

Four situations justify running models on your own hardware: data that cannot legally or contractually leave your environment (health records, government workloads, much of the UAE public sector); operations in places with unreliable connectivity — ships, sites, clinics, warehouses; sustained high-volume workloads where per-token API pricing exceeds amortised hardware; and any product where offline operation is itself the feature. If none of these apply, frontier APIs are usually cheaper and better once engineering time is counted honestly.

The local runtime landscape

  • Ollama — the developer default. One-line install, model library, and an OpenAI-compatible API on localhost:11434 — so anything built for the OpenAI format runs locally by changing the base URL. Ideal for single-machine agents and prototyping.
  • llama.cpp / GGUF — the engine underneath much of the ecosystem. Runs quantized GGUF models on CPU and GPU, down to laptops and edge devices. Maximum control, minimum ceremony.
  • vLLM — the serving layer for scale. PagedAttention and continuous batching deliver far higher throughput per GPU; the standard choice when one box must serve many concurrent agents.
  • LM Studio — a desktop GUI over the same model files — useful for non-engineers evaluating models, and it also exposes a local OpenAI-style server.

Hardware sizing, honestly

Model class Typical RAM/VRAM
(4-bit)
What it can carry Realistic speed
3-4B ~4 GB classification, routing, extraction, simple
fast even on CPU
chat
7-9B ~8 GB solid single-agent work, tool calling,
drafting
15-20 tok/s CPU; 5-10x on
GPU
12-14B ~12-16 GB better reasoning, longer context
discipline
needs a real GPU to feel
fluid
Model class Typical RAM/VRAM
(4-bit)
What it can carry Realistic speed
30-34B ~24-32 GB strong generalist; small-team workhorse single 24 GB+ GPU class
70B+ ~64 GB+ near-frontier quality on many tasks multi-GPU or Apple unified
memory

Quantization makes these numbers possible: 4-bit (Q4) versions keep most of a model's quality at roughly a quarter of the memory, and are the sensible default. Drop to Q8 or FP16 only when evals show a quality gap on your tasks. Two agent-specific traps: Ollama's default context window is 4,096 tokens — far too small for agent loops, so raise num_ctx explicitly and re-check memory headroom; and tool-calling reliability varies sharply between small models, so test that specifically before committing.

~8 GB

runs a quantized 8-9B model —
modern laptop territory
llama.cpp / Ollama guidance

4,096

Ollama's default context — raise
it for agents
Ollama docs

5-10x

typical GPU speed-up over CPU
inference
community benchmarks

The hybrid pattern: local by default, cloud by exception

Pure-local and pure-cloud are both usually wrong. The production pattern that keeps winning is a router: a local model handles the bulk of traffic — and everything touching sensitive data — while clearly-defined hard cases escalate to a frontier API with sensitive fields stripped or masked. Because Ollama speaks the OpenAI format, the router is often just your gateway from Chapter 7 with two routes and a policy.

Request
user / system event
Router
complexity + sensitivity gate
Local model
Ollama on your box — default
handles ~80-90% of traffic
Frontier API
hard cases — logged, budgeted
~10-20%, sensitive fields masked
PII or regulated data never leaves the building: the router strips or blocks it before any cloud call.

Figure 8.1 — The hybrid router: privacy and cost by default, frontier quality on demand.

'Local' ≠ 'private' by default

Local is not automatically private. Desktop runtimes may check for updates, send
telemetry, or load remote model cards; a misconfigured server binds to 0.0.0.0 and serves
your model to the office. For regulated work: pin versions, disable phone-home features,
firewall the box, and put the runtime behind the same audit logging as any other service.

Field example — a clinic network

A healthcare group with clinics across the Emirates wants an agent that drafts referral letters and answers protocol questions. Patient data cannot leave the premises and two sites have unreliable links. The shape: a 9B model on a small GPU server per clinic handles drafting and retrieval over local guidelines; a nightly batch syncs de-identified usage metrics; and only anonymised, non-clinical questions may escalate to a frontier API. Offline-first here is not an optimisation — it is the compliance story that makes the project approvable at all.