Some agents must run where the internet doesn't, or where the data must never leave. This chapter covers the local runtimes, hardware sizing, and the hybrid pattern that gives you both privacy and frontier quality.
When local is the right call
Four situations justify running models on your own hardware: data that cannot legally or contractually leave your environment (health records, government workloads, much of the UAE public sector); operations in places with unreliable connectivity — ships, sites, clinics, warehouses; sustained high-volume workloads where per-token API pricing exceeds amortised hardware; and any product where offline operation is itself the feature. If none of these apply, frontier APIs are usually cheaper and better once engineering time is counted honestly.
The local runtime landscape
- Ollama — the developer default. One-line install, model library, and an OpenAI-compatible API on localhost:11434 — so anything built for the OpenAI format runs locally by changing the base URL. Ideal for single-machine agents and prototyping.
- llama.cpp / GGUF — the engine underneath much of the ecosystem. Runs quantized GGUF models on CPU and GPU, down to laptops and edge devices. Maximum control, minimum ceremony.
- vLLM — the serving layer for scale. PagedAttention and continuous batching deliver far higher throughput per GPU; the standard choice when one box must serve many concurrent agents.
- LM Studio — a desktop GUI over the same model files — useful for non-engineers evaluating models, and it also exposes a local OpenAI-style server.
Hardware sizing, honestly
Model class Typical RAM/VRAM (4-bit) What it can carry Realistic speed 3-4B ~4 GB classification, routing, extraction, simple fast even on CPU chat 7-9B ~8 GB solid single-agent work, tool calling, drafting 15-20 tok/s CPU; 5-10x on GPU 12-14B ~12-16 GB better reasoning, longer context discipline needs a real GPU to feel fluid Model class Typical RAM/VRAM (4-bit) What it can carry Realistic speed 30-34B ~24-32 GB strong generalist; small-team workhorse single 24 GB+ GPU class 70B+ ~64 GB+ near-frontier quality on many tasks multi-GPU or Apple unified memory
Quantization makes these numbers possible: 4-bit (Q4) versions keep most of a model's quality at roughly a quarter of the memory, and are the sensible default. Drop to Q8 or FP16 only when evals show a quality gap on your tasks. Two agent-specific traps: Ollama's default context window is 4,096 tokens — far too small for agent loops, so raise num_ctx explicitly and re-check memory headroom; and tool-calling reliability varies sharply between small models, so test that specifically before committing.
~8 GB
runs a quantized 8-9B model — modern laptop territory llama.cpp / Ollama guidance
4,096
Ollama's default context — raise it for agents Ollama docs
5-10x
typical GPU speed-up over CPU inference community benchmarks
The hybrid pattern: local by default, cloud by exception
Pure-local and pure-cloud are both usually wrong. The production pattern that keeps winning is a router: a local model handles the bulk of traffic — and everything touching sensitive data — while clearly-defined hard cases escalate to a frontier API with sensitive fields stripped or masked. Because Ollama speaks the OpenAI format, the router is often just your gateway from Chapter 7 with two routes and a policy.
Request user / system event Router complexity + sensitivity gate Local model Ollama on your box — default handles ~80-90% of traffic Frontier API hard cases — logged, budgeted ~10-20%, sensitive fields masked PII or regulated data never leaves the building: the router strips or blocks it before any cloud call.
Figure 8.1 — The hybrid router: privacy and cost by default, frontier quality on demand.
'Local' ≠ 'private' by default
Local is not automatically private. Desktop runtimes may check for updates, send telemetry, or load remote model cards; a misconfigured server binds to 0.0.0.0 and serves your model to the office. For regulated work: pin versions, disable phone-home features, firewall the box, and put the runtime behind the same audit logging as any other service.
Field example — a clinic network
A healthcare group with clinics across the Emirates wants an agent that drafts referral letters and answers protocol questions. Patient data cannot leave the premises and two sites have unreliable links. The shape: a 9B model on a small GPU server per clinic handles drafting and retrieval over local guidelines; a nightly batch syncs de-identified usage metrics; and only anonymised, non-clinical questions may escalate to a frontier API. Offline-first here is not an optimisation — it is the compliance story that makes the project approvable at all.