Kai Hirota

14 min read2,979 words

The Missing Step in Reliable Agent Memory

A small query-time walk over declared derivation links closes the benchmark's hardest failure family, deterministically, across every storage backend I tested.

The hardest scenarios in the benchmark are not the ones where the memory has the wrong answer. They are the ones where the memory has the right answer, with the right citation, and still cannot tell that the answer is no longer safe to act on. The benchmark calls this family — cascading staleness — and every base system I tested scored zero on it.

This is the third piece in a three-part series. The first piece framed the problem: a memory can return an answer that is still in the store, still plausible, and still resting on a premise that has been quietly superseded. The second piece showed that which API the agent uses against the same store decides whether the dependency structure behind a claim is even visible — structured access lifts you into the high 350s out of 395, but it still won't catch the case where a premise three hops back has changed. That gap is what this article is about.

Here is the standalone version of the example. The store has an answer claim — Maya Patel has signing authority for Apple legal matters. That claim derives from Maya Patel is chief of staff, which derives from Tim Cook is CEO. Later, on Jun 1, the memory learns that Sarah Chen became CEO. The answer claim has not disappeared, but one upstream premise is no longer current.

The fix is a query-time walk over declared derived_from links. It is small enough to write in an afternoon, and the score lift is the same on every substrate that admits it.

Score lift

Before / after the premise walk.

0100200300395Row-store (SQLite)359388+29Mem0359388+29Graphiti354383+29Sonnet 4.6 (thinking on)359388+29before walkafter walk
Headline-corpus scores out of 395, before and after adding a query-time walk over declared derivation links. The lift is consistent across stores because the algorithm is the same; the store only determines how the edges are persisted.

The Walk#

A derived_from link is just an authored field on each claim: when the ingest step records a new claim that was inferred from an earlier one, it writes down which earlier claim it depended on. The procedure is: retrieve the candidate claim. Walk back along those declared derived_from edges. For each premise on the chain, ask the store what the current value is for that (entity, predicate) pair under the query's valid-time and transaction-time bounds. If any premise has been superseded, flip the answer's epistemic status to POTENTIALLY_STALE.

Walking the chain, one premise at a time.

Step 1

Start at the answer claim.

The query returns Maya Patel's signing authority. A safe interface treats this as the start of a walk, not the end of one.

Step 2

Hop back along derived_from.

The signing-authority claim was derived from a chief-of-staff claim, which was derived from a CEO claim. The walk follows the declared dependency edges.

Step 3

Compare each premise to the current value.

For every (entity, predicate) on the chain, ask the store: who is current under the query time bounds? Tim Cook was the premise. The current premise is Sarah Chen.

Step 4

If any premise was superseded, flip the status.

The answer claim is not deleted. It is still returned. But the epistemic status changes from UNVERIFIED to POTENTIALLY_STALE, because its support chain now passes through a superseded premise.

That is all of it. There is no machine learning, no embedding, no reranker, no calibration. The whole thing is a recursive lookup conditional on a deterministic time filter.

Four Substrates, One Algorithm#

What surprised me was how cleanly the procedure transfers. I implemented it on a SQLite row store as a recursive SQL query, on Mem0 by walking the metadata.derived_from field in Python, on Graphiti by walking the same kind of field on its edge objects, and on the model as a prompt rule with extended thinking turned on (a setting that lets the model spend more tokens on internal reasoning before answering). The same lift appears in all four.

Same algorithm

Four substrates. One walk. The same lift.

The walk

ABCderived_fromderived_from

Substrate

Row-store (SQLite)

recursive SQL query over claim_derivations

before
359
after
388
lift+29

Substrate

Mem0

walk metadata.derived_from

before
359
after
388
lift+29

Substrate

Graphiti

walk EntityEdge.attributes

before
354
after
383
lift+29

Substrate

Sonnet 4.6 (in-context)

prompt rule + extended thinking

before
359
after
388
lift+29
The walk is implemented as a recursive SQL query on SQLite, an in-Python edge walk on Mem0 and Graphiti, and a prompt rule for the Sonnet 4.6 model with extended thinking already on. All four close cascading staleness on the headline corpus. The constant lift is the point: this is a read-path algorithm, not a storage-engine feature, and it transfers cleanly across stores.

This is the central read of the result: cascading staleness is a read-path algorithm, not a storage feature. As long as the write path has authored the dependency edges and the read path is willing to walk them, the execution substrate beneath does not matter.

The Specificity Trap#

The naive fix is to flag any claim whose support chain touches a superseded premise. That catches every cascading-staleness case, but it also catches a lot of cases that aren't actually stale.

Concretely: suppose the agent asks "did Maya have signing authority in January?" The support chain still passes through "Tim Cook is CEO" — a premise that was later superseded, when Sarah Chen became CEO on Jun 1. A walk that flags every supersession in the chain would mark the January claim POTENTIALLY_STALE. But for the period being asked about, the chain was sound. The supersession sits in the future relative to the query.

This is the specificity test: 28 scenarios where a real supersession exists, but it lies outside the bitemporal scope the agent is asking about. In the January example, it is outside the valid-time window; in a late-correction case, it may be outside the transaction-time knowledge cutoff. A correct walk leaves these alone.

The denominator differs from the headline E.5 count because E.6 is a matched real-entity control: 28 real-entity stale cases get 28 near-misses. The headline E.5 count is 29 because it also includes the Acme worked-example case.

Valid-time scope

Drag the query bound across the supersession.

UNVERIFIED
Jan 2027Jul 2027Jan 2028Jul 2028

If query bound < supersession (E.6)

The supersession is outside the query's valid-time scope. Status: UNVERIFIED.

If query bound ≥ supersession (E.5)

A premise was superseded inside scope. Status: POTENTIALLY_STALE.

This view moves the query's valid-time bound; the same walk is also scoped by the transaction-time knowledge cutoff. Marking everything stale near a supersession would fail the E.6 specificity control — 28 out of 28 near-misses where the supersession exists in the world but lies outside the bitemporal scope the agent was asking about.

So the walk has to be scoped to the query's bitemporal bounds. For each premise, compare against the value that is current for the same (entity, predicate) at the query's valid time TvT_v and known by the query's transaction time TtT_t. A supersession that starts after the period being audited, or that the memory had not learned by the transaction-time cutoff, should not make the January answer stale.

Reasoning Alone Doesn't Close the Gap#

Before turning to whether a model can run the walk by itself, it's worth ruling out the simpler hypothesis: can you just give the model more reasoning budget and have it solve E.5 on its own? I ran three within-subject sweeps — same model, same prompt, varying only the reasoning budget knob the system exposes.

Within-subject reasoning controls

Score moves with reasoning budget. E.5 doesn't.

GPT-5-mini · reasoning effortE.5 0/29low347medium347high347no change across the sweepSonnet 4.6 · extended thinkingE.5 0/29off276on (8K budget)359+83 with thinking onHoncho 3 Dialectic Agent · reasoning levelE.5 0/29minimal44low45medium44high44max484-scenario spread across all five levels395 (full)
Three models, three sweeps that vary only the reasoning budget. Total scores move by up to 83 scenarios on the Sonnet 4.6 thinking-on/off pair. Across every setting in every group, E.5 stays at zero out of 29 — you cannot reason your way past cascading staleness without a procedure to actually walk the chain.

Total scores move. Sonnet 4.6 picks up 83 scenarios when extended thinking is turned on; is bit-identical across its three effort levels; wanders within a four-scenario band. But the E.5 column is locked at 0/29 in every single condition. Reasoning budget alone is not what unlocks cascading staleness — you need a procedure that actually walks the chain.

When the LLM Walks the Chain Itself#

If the storage path will not walk the chain, the model has to. I tested this by giving capable long-context LLMs the source documents plus a prompt rule that described the walk, and asked them to execute it themselves. Two pass/fail tests separate the results: did the system catch the genuinely stale case (E.5, sensitivity), and did it leave the alone (E.6, specificity)?

Sensitivity vs specificity

Step 1 of 3 — All systems

Thirteen systems queued for two tests.

Each system in its default setup, plus six variants that walk the dependency chain at query time.

no walk — base
no walk — LLM
walk on storage
walk in prompt (capable)
walk in prompt (over-marks)
Two tests with simple pass/fail outcomes on the 28 matched real-entity pairs. E.5 (sensitivity) asks whether the system flagged the genuinely stale case; E.6 (specificity) asks whether it left the near-miss alone. Storage walks land in the top-right cell deterministically. Prompt walks depend on how carefully the model can reason — one variant slips into the bottom-right cell because it marks every near-miss stale.

GPT-5-mini and Sonnet 4.6 with extended thinking on match the storage walk on both tests. Sonnet 4.6 with extended thinking off closes E.5 but marks every near-miss stale — it catches the real case, but it also raises false alarms on cases that aren't actually stale. Smaller models miss both axes. Storage walks land in the top-right cell deterministically: every E.5 caught, every near-miss preserved.

The Full Scoreboard#

Pulling all the systems onto one page: where does each one end up, and what does the walk lift do to the systems that admit it?

Headline scoreboard

All evaluated systems, sorted by full-rubric pass rate.

full = 395Reference (DuckDB)395Mem0 (OSS), Mode A359 → 388Row-store baseline (SQLite, no walk)359 → 388Deterministic graph (NetworkX, no LLM)359Mem0 (OSS), Mode C (LLM dispatch)357Graphiti (OSS), Mode A354 → 383GPT-5.5 (in-context)347Graphiti (OSS), Mode C (LLM dispatch)345Sonnet 4.6 (in-context, thinking off)276 → 388TencentDB-Agent-Memory, L0162Zep Cloud (hosted, edge-scope)160Mem0 (OSS), Mode B149Zep Cloud (hosted, scope=auto)121Supermemory, NL110Hindsight, recall101Cognee, NL63Graphiti (OSS), Mode B49Honcho 3, NL (Dialectic, medium)44Mem0 Platform v3 (hosted)22
Reference (oracle)
Deployed memory system
Hosted product (native API)
No-LLM control
In-context LLM
RAG baseline
premise-walk lift (where measured)
Full-rubric pass rate on the 395-scenario headline corpus. The dashed extension on a bar shows where a system lands after the query-time premise walk is added. The reference DuckDB row defines the interface; everything below it is what existing systems return on the same questions.

The DuckDB row at the top is the interface specification — a roughly fifty-line SQL store implementing the formal validity predicate. It defines what 395/395 means; it is not a competitor. Everything below it is a system you might actually deploy. The dashed extensions show the read-path walk lift on the systems where I measured it.

The strongest structured and no-LLM non-walk rows in the chart cap near 359/395 — a 36-scenario gap dominated by the cascading-staleness family (E.5: 29 scenarios). The lower-scoring NL and agentic-memory rows fail additional retrieval or citation axes before the walk even matters. The Before / after the premise walk figure at the top of this article shows what closing the E.5-dominated gap looks like substrate by substrate.

A concrete E.5 failure, using the Apple example above: after the Jun 1 CEO update, ask “does Maya Patel still have signing authority for Apple legal matters?” The signing-authority claim is in the store with its derived_from link pointing at the chief-of-staff claim, which points back at the Jan 1 “Tim Cook is CEO” premise. The system returns yes, with the right citation. The Jun 1 update where Sarah Chen becomes CEO never enters the answer. A reader of the answer cannot tell that the support chain has changed — only the status on the answer should have flipped from UNVERIFIED to POTENTIALLY_STALE.

A few things in the chart are worth pointing out beyond the obvious.

The walk contributes a constant lift of about 29 scenarios across substrates. SQLite row-store 359 → 388, Mem0 A 359 → 388, Graphiti A 354 → 383, Sonnet 4.6 (with extended thinking) 359 → 388. Same procedure, same delta, three storage backends plus an in-context LLM. The walk is substrate-agnostic in a strict sense: the work it does is bounded by the corpus, not by what is underneath.

The Sonnet path has two effects, not one. Sonnet 4.6 in-context starts at 276/395. Turning on extended thinking lifts that to 359/395 (Δ83) without touching E.5 at all — the model just gets better at the rest of the corpus. The chart above isolates the second step: adding the premise-walk prompt on top contributes the same Δ29 the storage walks deliver elsewhere, taking it to 388. The walk and the reasoning budget are doing different jobs; you need both for Sonnet to land at the ceiling.

Reasoning budget alone does not substitute for the walk. Three within-subject controls in the paper hold the model fixed and vary one factor: GPT-5-mini reasoning effort (low / medium / high), Sonnet 4.6 extended thinking (off / on), Honcho 3 reasoning level across five values. Total scores move — sometimes by +83 scenarios — but the E.5 family stays locked at 0/29 in every condition. Whatever extra computation the model spends, it isn't finding the cascading-staleness label on its own.

The walk is adding status, not accuracy. For systems whose base answer+citation score was already high (Mem0 A: 389/395 on answer+citation), the +walk answer+citation barely moves (389 → 389), while the full-rubric jumps (359 → 388). The walk is closing the gap between “I have the right answer” and “I can also say whether the answer is still safe to use” — a pure epistemic-status fix, not a retrieval one.

The score-vs-cost Pareto frontier is the no-LLM walk. The highest-scoring non-reference row on the board (388/395) is also one of the cheapest: row-store SQLite + recursive-CTE walk runs in under a millisecond per query at zero tokens. The GPT-5.5 in-context row reaches 347/395 at roughly 5 s per query with meaningful token spend; agentic NL retrieval (Mem0 Mode B) costs about $0.79 per query for 149/395. High-scoring rows cluster in the 347–395 band but span roughly five orders of magnitude in cost. The walk's leverage is that it doesn't depend on the LLM at all.

Do the Hosted Products Do Better?#

The rows above run the open-source libraries locally, behind a thin wrapper that applies the benchmark's time-and-evidence filter. The fair next question is whether the managed products — the ones you would actually buy — do better through their own native interfaces. So I added three hosted rows: under edge-scope date filters and under its scope=auto context block, and on its temporal-reasoning surface.

They do not. The native surfaces land far below the OSS Mode A wrapper-oracle — Zep Cloud 160/395 and Mem0 Platform v3 22/395, against 354 and 359 — and the reason is the more interesting part.

Hosted current products

Strong retrieval, weak audit — through a native interface.

full-rubric audit
retrieved, but fails the citation / epistemic axes
OSS wrapper-oracle contrast
Out of 395 headline scenarios. Zep Cloud's native valid_at / created_at date filters and provenance-to-source retrieve the correct claim on ~309/395, but only 160 survive the full audit — the dashed remainder is the citation and epistemic-status axes the managed surface does not expose. Run locally with a wrapper-side filter (Mem0 OSS Mode A), retrieval and audit nearly coincide.

Zep markets the surface as “context you can trace, filter, and trust”, and the first two of those are real. Every stored fact links back to the source episode it was read from, and you can filter retrieval by valid_at / created_at date ranges and by metadata such as a verified flag. That is exactly why Zep Cloud retrieves the correct claim on ~309/395 — better than any other deployed comparator in the chart. But trust under change is a separate primitive from trace and filter. Strong retrieval does not confer bitemporal-audit completeness: only 160 of those 309 survive the full rubric, and the dashed remainder is the citation and epistemic-status axes. The scope=auto context block retrieves marginally more (318) but audits less (121), because its cross-scope summary underserves structured citation.

And on the hardest family, the hosted products fail exactly the way everything else does. The shared E.5 gap now spans nine product surfaces — Graphiti and Mem0 open-source, the hosted Zep Cloud and Mem0 Platform v3 current products, plus TencentDB-Agent-Memory, Cognee, Honcho 3, Supermemory, and Hindsight. The hosted Zep Cloud surface is the sharpest illustration: its native date filters retrieve the E.5 answer on most instances, and it still returns UNVERIFIED, never POTENTIALLY_STALE. The products are not failing to find the claim. They are returning it without a signal that its support has moved.

What This Result Doesn't Claim#

The walk works conditional on a write path that authors the derivation edges. The benchmark assumes that input. If your ingest step does not preserve derived_from, the read-path procedure has nothing to walk.

Scope

What the benchmark measures, and what it doesn't.

Write pathout of scopedocument arrivesextraction (LLM or other)claim + derived_from emittedpersisted to storestoreRead pathmeasuredquery receivedretrieve candidate claimwalk derived_from closurecompare each premise to currentemit answer + status
The benchmark assumes the write path has already authored the derivation edges. It measures whether the read path uses them. Extraction quality is a separate problem; better extraction does not help if retrieval never walks the chain.

That is a separate, parallel problem — write-path extraction quality — and it has its own substantial literature. What the benchmark establishes is that, given the edges, the read path is a small and store-independent piece of code. The hard part is making it standard practice in agent-memory APIs.

The reason it has not been standard is, I suspect, that the failure is invisible from the answer alone. The answer is right. The citation is right. Only the status on that answer is wrong, and most memory APIs don't even have a place to put a status. That is what the benchmark is for. The framing piece draws the parallel to the metacognition argument for parametric models: in both settings, reliability hinges on a per-answer faithful signal, not on aggregate accuracy.

Benchmark Summary#

TaskBaselineProposed methodMetricDeltaCaveat
Row-store premise statusSQLite without walk: 359/395Recursive-CTE premise walk: 388/395Full-rubric pass count+29 scenariosRequires claim_derivations links.
Mem0 premise statusMem0 A: 359/395metadata.derived_from walk: 388/395Full-rubric pass count+29 scenariosMetadata must preserve stable claim IDs.
Graphiti premise statusGraphiti A: 354/395EntityEdge.attributes walk: 383/395Full-rubric pass count+29 scenariosNative graph edges alone are not enough; retrieval must walk them.
False-stale specificityE.6 near-missesQuery-scoped temporal comparisonFalse stale count0/28 on storage walksMarking every superseded chain stale fails this control.

Limitations#

  • Depends on write-path quality: missing provenance links reduce recall of stale-chain detection.
  • Needs stable claim IDs across ingestion, compaction, and deduplication.
  • Deep or noisy provenance graphs need cycle handling, fan-out caps, and observability around skipped links.
  • Benchmark scores are controlled evaluations; production distributions and noise can differ.

Concurrent work I missed#

When this series went out I had not seen two Mem0 blog posts that bear on it directly: Introducing the Token-Efficient Memory Algorithm (April 16, 2026) and The Token-Efficient Memory Algorithm Now Has Temporal Reasoning (May 14, 2026). The May post adds a state-key + event_end model for ongoing facts — a partial valid-time analogue, with write-time temporal metadata and reranking (rather than filtering) at read time.

The work overlaps with the temporal half of this benchmark's contract but does not subsume it. Mem0's update tracks one clock; the benchmark requires two (valid time and transaction time, so retroactive corrections remain distinguishable from real-time updates). It also has no documented slot for per-claim epistemic status or for the derivation-closure walk that closes the E.5 family. The May post adds a memory decay signal alongside the state-key model — recent memories are boosted up to 1.5× and stale ones dampened to 0.3× at ranking time — but that is a recency knob on retrieval, not a propagation of supersession through a premise chain; a dampened claim is still returned, just lower, and still without a status. The May post's own numbers are consistent with that boundary: +3.8 points on LongMemEval temporal reasoning, but −2.6 points on the knowledge-update category — reranking surfaces a newer dated instance, but does not, by itself, propagate a status change through a premise chain.

Since first publishing, I have measured that release directly rather than only reading about it. Mem0 documents both the temporal-reasoning surface and memory decay as platform-only features on top of the open-source base algorithm, so they are exactly what the Mem0 Platform v3 (hosted) row in the scoreboard above evaluates — 22/395, with E.5 still 0/29. The open-source Mem0 Mode A / B / C rows reflect the base-algorithm version from the paper; the hosted row is the temporal-reasoning platform. Both land on the same side of the boundary.

References#

  1. Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956.
  2. Chhikara, P., Khant, D., Aryan, S., Singh, T., & Yadav, D. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413.
  3. Marković, V., Obradović, L., Hajdu, L., & Pavlović, J. (2025). Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning. arXiv:2505.24478.

Systems referenced#

The scoreboard chart in this article evaluates the following memory systems. Each row is run against its public production interface; configuration detail is in the paper.

  • Graphiti (OSS) — temporal knowledge-graph memory library for LLM agents, run locally with a wrapper-side filter. github.com/getzep/graphiti.
  • Zep Cloud (hosted) — the managed product from the same team, queried through its native graph.search() valid_at / created_at date filters and its scope=auto context block. getzep.com.
  • Mem0 (OSS) — write-time-extraction memory layer (local), with a structured API and an NL retrieval call. mem0.ai · github.com/mem0ai/mem0.
  • Mem0 Platform v3 (hosted) — the managed product's temporal-reasoning surface (state-key + event_end, memory decay), documented as platform-only. mem0.ai.
  • TencentDB-Agent-Memory — local SQLite-backed layered memory with L0 conversation search, L1 structured memories, and traceable persona / scenario summaries. github.com/Tencent/TencentDB-Agent-Memory.
  • Supermemory — write-time extraction into a memory graph with documented update / extend / derive relation labels. supermemory.ai.
  • Hindsight — typed four-network agent memory with retain / recall / reflect operations. github.com/vectorize-io/hindsight.
  • Cognee — graph + vector memory engine with GRAPH_COMPLETION, CHUNKS, and TEMPORAL search modes. github.com/topoteretes/cognee.
  • Honcho 3 — managed personal-memory service driven through a Dialectic Agent. honcho.dev.