Part II: The Toolkit

Chapter 05

Memory: From Context Windows to Knowledge Graphs

A practical taxonomy of agent memory, including working, episodic, semantic, and procedural memory, plus vector, graph, and tiered architectures.

Context is not memory. The best strategies treat memory as an engineered subsystem: typed stores, explicit write policies, multi-signal retrieval, and ruthless context hygiene.

Why 'just use a big context window' fails

Million-token windows tempt teams to stuff the entire history into every call. Three forces break this at scale. Economics: shipping 100K tokens of history to produce a 50-token reply is unsustainable at volume — and agents amplify it, because the loop re-sends context every step. Latency: time-to-first-token grows with input size, which real-time products feel immediately. And recall: models demonstrably lose facts buried in the middle of huge prompts — the well-documented 'lost in the middle' effect. Memory engineering exists to send the model less, but the right less.

A taxonomy worth memorizing

SHORT-TERM (inside the context window)
LONG-TERM (external stores, retrieved on demand)
Working
the live task: goal, recent
turns, tool results,
scratchpad
Episodic
what happened: past
sessions, decisions,
outcomes
Semantic
facts: user prefs, entities,
domain knowledge
Procedural
how-to: learned rules,
playbooks, refined
instructions

Figure 5.1 — Four memory types. Working memory lives in-context; the rest live outside and are retrieved.

  • Working memory — the current task's context window: goal, recent turns, tool results, a scratchpad. Managed by compaction, not storage.
  • Episodic memory — a record of what happened — past sessions, decisions, outcomes. Powers 'as we discussed last week' continuity and post-hoc audits.
  • Semantic memory — distilled facts: the user prefers Arabic invoices, the client's fiscal year ends in March. Small, dense, high-value.
  • Procedural memory — learned how-to: refined instructions, playbooks, the agent's own improved prompts. The least built, highest-leverage layer.

The memory pipeline

Every serious memory system — whatever its storage engine — implements the same five-stage pipeline. The quality differences hide in stages two and four: what gets extracted as worth remembering, and how retrieval combines semantic similarity with recency, graph relationships, and keywords.

Capture
turns, tool results,
outcomes
Extract
LLM distills facts
and events
Store
vectors, graph,
key-value
Retrieve
semantic +
temporal + graph
feedback: usage signals update salience; stale facts decay or are invalidated
Inject
only what this step
needs

Figure 5.2 — The universal memory pipeline. Extraction and retrieval quality decide everything.

The 2026 systems, honestly compared

Five architectures dominate, and vendor benchmarks disagree enough that you should treat all of them as directional. The widely used LOCOMO benchmark sparked a public dispute: Mem0's peer-reviewed paper reported strong wins on token efficiency and temporal reasoning, while Zep published a rebuttal claiming misconfiguration and a materially higher corrected score. The honest takeaways that survive the crossfire: selective extraction beats raw history by a wide margin; temporal modeling is where naive vector stores fail hardest (one study measured graph-based memory nearly tripling a major provider's built-in memory on time-sensitive questions, largely because the latter never timestamped facts); and graph construction buys reasoning power at real token and latency cost.

System Architecture Strongest at Trade-off to plan for
Mem0 Hybrid vector + graph +
KV; user / session / agent
scopes
Drop-in personalization; very
token-lean (single-digit-K tokens per
conversation reported)
A memory store, not a platform —
pipelines and connectors are on
you
Zep
(Graphiti)
Temporal knowledge
graph
Time-aware reasoning, evolving
entities ('used to live in...'),
long-running enterprise sessions
Graph building is heavy; ingestion
is async, so just-written facts may
lag
Letta
(MemGPT)
OS-style tiers; the agent
edits its own memory
Long-horizon autonomous agents
that must self-manage context
The LLM is in the control loop —
more flexible, less predictable
LangMem Vector-first,
LangGraph-native
Teams already on
LangChain/LangGraph wanting
least-resistance memory
Simpler retrieval; weaker on
temporal and relational queries
Cognee Structured memory graphs
from data + chats
Institutional knowledge: building a
queryable graph of customers, docs,
history
More setup; closer to a
knowledge-engineering project

The only benchmark that matters is yours

Run your own bake-off. Pick two systems, load a week of your real transcripts, and test the
queries your product actually needs — especially temporal ones ('what changed since the
last order?'). Published benchmarks were not run on your data shape, and the 15-point
swings between architectures on temporal retrieval are larger than most vendors' headline
differences.

Context engineering: the other half of memory

Anthropic's applied-AI team popularized the framing that the scarce resource is not storage but attention: every token in context competes for it. Memory retrieval decides what enters the window; context engineering decides what stays. The production toolkit:

  • Compaction — when the window passes a threshold (60-70% is a common trigger), summarize the oldest turns into a brief and continue. Preserve decisions, constraints, and open questions verbatim; compress the chatter.
  • Structured scratchpads — have the agent maintain an explicit plan / progress / blockers note instead of re-deriving state from raw history each loop — cheaper and more reliable.
  • Context isolation by sub-agent — give each worker only its slice (Chapter 6). A researcher doesn't need the billing thread.
  • Just-in-time retrieval — fetch documents when a step needs them and drop them after, rather than pinning everything for the whole run.
  • Write policy + decay — decide what is worth remembering at write time, attach timestamps and sources, and let unused or contradicted facts expire. The classic failure — an assistant confidently using a fact the user corrected months ago — is an invalidation bug, not a retrieval bug. Memory quality is an editorial function: the system that remembers everything understands nothing.