14 min read2,979 words
The Missing Step in Reliable Agent Memory
A small query-time walk over declared derivation links closes the benchmark's hardest failure family, deterministically, across every storage backend I tested.
The hardest scenarios in the benchmark are not the ones where the memory has the wrong answer. They are the ones where the memory has the right answer, with the right citation, and still cannot tell that the answer is no longer safe to act on. The benchmark calls this family — cascading staleness — and every base system I tested scored zero on it.
This is the third piece in a three-part series. The first piece framed the problem: a memory can return an answer that is still in the store, still plausible, and still resting on a premise that has been quietly superseded. The second piece showed that which API the agent uses against the same store decides whether the dependency structure behind a claim is even visible — structured access lifts you into the high 350s out of 395, but it still won't catch the case where a premise three hops back has changed. That gap is what this article is about.
Here is the standalone version of the example. The store has an answer claim — Maya Patel has signing authority for Apple legal matters. That claim derives from Maya Patel is chief of staff, which derives from Tim Cook is CEO. Later, on Jun 1, the memory learns that Sarah Chen became CEO. The answer claim has not disappeared, but one upstream premise is no longer current.
The fix is a query-time walk over declared derived_from links. It is small enough to write in an
afternoon, and the score lift is the same on every substrate that admits it.
Score lift
Before / after the premise walk.
The Walk#
A derived_from link is just an authored field on each claim: when the ingest step records a new
claim that was inferred from an earlier one, it writes down which earlier claim it depended on.
The procedure is: retrieve the candidate claim. Walk back along those declared derived_from
edges. For each premise on the chain, ask the store what the current value is for that
(entity, predicate) pair under the query's valid-time and transaction-time bounds. If any premise has been superseded,
flip the answer's epistemic status to POTENTIALLY_STALE.
Scroll explainer
Walking the chain, one premise at a time.
Premise walk
POTENTIALLY_STALEJan 15 — answer claim
Maya Patel has signing authority
Derived from the chief-of-staff claim.
Jan 8 — direct premise
Maya Patel is chief of staff
Derived from the CEO claim.
Jan 1 — root premise
Tim Cook is CEO of Apple
The premise the chain ultimately rests on.
Φ(Apple, ceo, query time) →
Jun 1 — current premise
Sarah Chen is CEO of Apple
Supersedes Tim Cook. The chain's root premise has changed.
Step 1
Start at the answer claim.
The query returns Maya Patel's signing authority. A safe interface treats this as the start of a walk, not the end of one.
Step 2
Hop back along derived_from.
The signing-authority claim was derived from a chief-of-staff claim, which was derived from a CEO claim. The walk follows the declared dependency edges.
Step 3
Compare each premise to the current value.
For every (entity, predicate) on the chain, ask the store: who is current under the query time bounds? Tim Cook was the premise. The current premise is Sarah Chen.
Step 4
If any premise was superseded, flip the status.
The answer claim is not deleted. It is still returned. But the epistemic status changes from UNVERIFIED to POTENTIALLY_STALE, because its support chain now passes through a superseded premise.
Step 1
Start at the answer claim.
The query returns Maya Patel's signing authority. A safe interface treats this as the start of a walk, not the end of one.
Step 2
Hop back along derived_from.
The signing-authority claim was derived from a chief-of-staff claim, which was derived from a CEO claim. The walk follows the declared dependency edges.
Step 3
Compare each premise to the current value.
For every (entity, predicate) on the chain, ask the store: who is current under the query time bounds? Tim Cook was the premise. The current premise is Sarah Chen.
Step 4
If any premise was superseded, flip the status.
The answer claim is not deleted. It is still returned. But the epistemic status changes from UNVERIFIED to POTENTIALLY_STALE, because its support chain now passes through a superseded premise.
That is all of it. There is no machine learning, no embedding, no reranker, no calibration. The whole thing is a recursive lookup conditional on a deterministic time filter.
Four Substrates, One Algorithm#
What surprised me was how cleanly the procedure transfers. I implemented it on a SQLite row store
as a recursive SQL query, on Mem0 by walking the metadata.derived_from field in Python, on
Graphiti by walking the same kind of field on its edge objects, and on the model
as a prompt rule with extended thinking turned on (a setting that lets the model spend more tokens
on internal reasoning before answering). The same lift appears in all four.
Same algorithm
Four substrates. One walk. The same lift.
The walk
Substrate
Row-store (SQLite)
recursive SQL query over claim_derivations
Substrate
Mem0
walk metadata.derived_from
Substrate
Graphiti
walk EntityEdge.attributes
Substrate
Sonnet 4.6 (in-context)
prompt rule + extended thinking
This is the central read of the result: cascading staleness is a read-path algorithm, not a storage feature. As long as the write path has authored the dependency edges and the read path is willing to walk them, the execution substrate beneath does not matter.
The Specificity Trap#
The naive fix is to flag any claim whose support chain touches a superseded premise. That catches every cascading-staleness case, but it also catches a lot of cases that aren't actually stale.
Concretely: suppose the agent asks "did Maya have signing authority in January?" The support
chain still passes through "Tim Cook is CEO" — a premise that was later superseded, when Sarah
Chen became CEO on Jun 1. A walk that flags every supersession in the chain would mark the January
claim POTENTIALLY_STALE. But for the period being asked about, the chain was sound. The
supersession sits in the future relative to the query.
This is the specificity test: 28 scenarios where a real supersession exists, but it lies outside the bitemporal scope the agent is asking about. In the January example, it is outside the valid-time window; in a late-correction case, it may be outside the transaction-time knowledge cutoff. A correct walk leaves these alone.
The denominator differs from the headline E.5 count because E.6 is a matched real-entity control: 28 real-entity stale cases get 28 near-misses. The headline E.5 count is 29 because it also includes the Acme worked-example case.
Valid-time scope
Drag the query bound across the supersession.
If query bound < supersession (E.6)
The supersession is outside the query's valid-time scope. Status: UNVERIFIED.
If query bound ≥ supersession (E.5)
A premise was superseded inside scope. Status: POTENTIALLY_STALE.
So the walk has to be scoped to the query's bitemporal bounds. For each premise, compare
against the value that is current for the same (entity, predicate) at the query's valid time
and known by the query's transaction time . A supersession that starts after the
period being audited, or that the memory had not learned by the transaction-time cutoff, should not
make the January answer stale.
Reasoning Alone Doesn't Close the Gap#
Before turning to whether a model can run the walk by itself, it's worth ruling out the simpler hypothesis: can you just give the model more reasoning budget and have it solve E.5 on its own? I ran three within-subject sweeps — same model, same prompt, varying only the reasoning budget knob the system exposes.
Within-subject reasoning controls
Score moves with reasoning budget. E.5 doesn't.
Total scores move. Sonnet 4.6 picks up 83 scenarios when extended thinking is turned on; is bit-identical across its three effort levels; wanders within a four-scenario band. But the E.5 column is locked at 0/29 in every single condition. Reasoning budget alone is not what unlocks cascading staleness — you need a procedure that actually walks the chain.
When the LLM Walks the Chain Itself#
If the storage path will not walk the chain, the model has to. I tested this by giving capable long-context LLMs the source documents plus a prompt rule that described the walk, and asked them to execute it themselves. Two pass/fail tests separate the results: did the system catch the genuinely stale case (E.5, sensitivity), and did it leave the alone (E.6, specificity)?
Sensitivity vs specificity
Step 1 of 3 — All systems
Thirteen systems queued for two tests.
Each system in its default setup, plus six variants that walk the dependency chain at query time.
GPT-5-mini and Sonnet 4.6 with extended thinking on match the storage walk on both tests. Sonnet 4.6 with extended thinking off closes E.5 but marks every near-miss stale — it catches the real case, but it also raises false alarms on cases that aren't actually stale. Smaller models miss both axes. Storage walks land in the top-right cell deterministically: every E.5 caught, every near-miss preserved.
The Full Scoreboard#
Pulling all the systems onto one page: where does each one end up, and what does the walk lift do to the systems that admit it?
Headline scoreboard
All evaluated systems, sorted by full-rubric pass rate.
The DuckDB row at the top is the interface specification — a roughly fifty-line SQL store implementing the formal validity predicate. It defines what 395/395 means; it is not a competitor. Everything below it is a system you might actually deploy. The dashed extensions show the read-path walk lift on the systems where I measured it.
The strongest structured and no-LLM non-walk rows in the chart cap near 359/395 — a 36-scenario gap dominated by the cascading-staleness family (E.5: 29 scenarios). The lower-scoring NL and agentic-memory rows fail additional retrieval or citation axes before the walk even matters. The Before / after the premise walk figure at the top of this article shows what closing the E.5-dominated gap looks like substrate by substrate.
A concrete E.5 failure, using the Apple example above: after the Jun 1 CEO update, ask
“does Maya Patel still have signing authority for Apple legal matters?” The signing-authority claim is in the store with
its derived_from link pointing at the chief-of-staff claim, which points back at the Jan 1
“Tim Cook is CEO” premise. The system returns yes, with the right citation. The
Jun 1 update where Sarah Chen becomes CEO never enters the answer. A reader of the answer cannot
tell that the support chain has changed — only the status on the answer should have flipped
from UNVERIFIED to POTENTIALLY_STALE.
A few things in the chart are worth pointing out beyond the obvious.
The walk contributes a constant lift of about 29 scenarios across substrates. SQLite row-store 359 → 388, Mem0 A 359 → 388, Graphiti A 354 → 383, Sonnet 4.6 (with extended thinking) 359 → 388. Same procedure, same delta, three storage backends plus an in-context LLM. The walk is substrate-agnostic in a strict sense: the work it does is bounded by the corpus, not by what is underneath.
The Sonnet path has two effects, not one. Sonnet 4.6 in-context starts at 276/395. Turning on extended thinking lifts that to 359/395 (Δ83) without touching E.5 at all — the model just gets better at the rest of the corpus. The chart above isolates the second step: adding the premise-walk prompt on top contributes the same Δ29 the storage walks deliver elsewhere, taking it to 388. The walk and the reasoning budget are doing different jobs; you need both for Sonnet to land at the ceiling.
Reasoning budget alone does not substitute for the walk. Three within-subject controls in the paper hold the model fixed and vary one factor: GPT-5-mini reasoning effort (low / medium / high), Sonnet 4.6 extended thinking (off / on), Honcho 3 reasoning level across five values. Total scores move — sometimes by +83 scenarios — but the E.5 family stays locked at 0/29 in every condition. Whatever extra computation the model spends, it isn't finding the cascading-staleness label on its own.
The walk is adding status, not accuracy. For systems whose base answer+citation score was already high (Mem0 A: 389/395 on answer+citation), the +walk answer+citation barely moves (389 → 389), while the full-rubric jumps (359 → 388). The walk is closing the gap between “I have the right answer” and “I can also say whether the answer is still safe to use” — a pure epistemic-status fix, not a retrieval one.
The score-vs-cost Pareto frontier is the no-LLM walk. The highest-scoring non-reference row on the board (388/395) is also one of the cheapest: row-store SQLite + recursive-CTE walk runs in under a millisecond per query at zero tokens. The GPT-5.5 in-context row reaches 347/395 at roughly 5 s per query with meaningful token spend; agentic NL retrieval (Mem0 Mode B) costs about $0.79 per query for 149/395. High-scoring rows cluster in the 347–395 band but span roughly five orders of magnitude in cost. The walk's leverage is that it doesn't depend on the LLM at all.
Do the Hosted Products Do Better?#
The rows above run the open-source libraries locally, behind a thin wrapper that
applies the benchmark's time-and-evidence filter. The fair next question is
whether the managed products — the ones you would actually buy — do
better through their own native interfaces. So I added three hosted rows:
under edge-scope date filters and under its scope=auto context block, and
on its temporal-reasoning surface.
They do not. The native surfaces land far below the OSS Mode A wrapper-oracle — Zep Cloud 160/395 and Mem0 Platform v3 22/395, against 354 and 359 — and the reason is the more interesting part.
Hosted current products
Strong retrieval, weak audit — through a native interface.
valid_at / created_at date filters and provenance-to-source retrieve the correct claim on ~309/395, but only 160 survive the full audit — the dashed remainder is the citation and epistemic-status axes the managed surface does not expose. Run locally with a wrapper-side filter (Mem0 OSS Mode A), retrieval and audit nearly coincide.Zep markets the surface as “context you can trace, filter, and
trust”,
and the first two of those are real. Every stored fact links back to the source
episode it was read from, and you can filter retrieval by valid_at /
created_at date ranges and by metadata such as a verified flag. That is
exactly why Zep Cloud retrieves the correct claim on ~309/395 — better
than any other deployed comparator in the chart. But trust under change is a
separate primitive from trace and filter. Strong retrieval does not confer
bitemporal-audit completeness: only 160 of those 309 survive the full rubric,
and the dashed remainder is the citation and epistemic-status axes. The
scope=auto context block retrieves marginally more (318) but audits less
(121), because its cross-scope summary underserves structured citation.
And on the hardest family, the hosted products fail exactly the way everything
else does. The shared E.5 gap now spans nine product surfaces —
Graphiti and Mem0 open-source, the hosted Zep Cloud and Mem0 Platform v3 current
products, plus TencentDB-Agent-Memory, Cognee, Honcho 3, Supermemory, and
Hindsight. The hosted Zep Cloud surface is the sharpest illustration: its native
date filters retrieve the E.5 answer on most instances, and it still returns
UNVERIFIED, never POTENTIALLY_STALE. The products are not failing to find
the claim. They are returning it without a signal that its support has moved.
What This Result Doesn't Claim#
The walk works conditional on a write path that authors the derivation edges. The benchmark
assumes that input. If your ingest step does not preserve derived_from, the read-path procedure
has nothing to walk.
Scope
What the benchmark measures, and what it doesn't.
That is a separate, parallel problem — write-path extraction quality — and it has its own substantial literature. What the benchmark establishes is that, given the edges, the read path is a small and store-independent piece of code. The hard part is making it standard practice in agent-memory APIs.
The reason it has not been standard is, I suspect, that the failure is invisible from the answer alone. The answer is right. The citation is right. Only the status on that answer is wrong, and most memory APIs don't even have a place to put a status. That is what the benchmark is for. The framing piece draws the parallel to the metacognition argument for parametric models: in both settings, reliability hinges on a per-answer faithful signal, not on aggregate accuracy.
Benchmark Summary#
| Task | Baseline | Proposed method | Metric | Delta | Caveat |
|---|---|---|---|---|---|
| Row-store premise status | SQLite without walk: 359/395 | Recursive-CTE premise walk: 388/395 | Full-rubric pass count | +29 scenarios | Requires claim_derivations links. |
| Mem0 premise status | Mem0 A: 359/395 | metadata.derived_from walk: 388/395 | Full-rubric pass count | +29 scenarios | Metadata must preserve stable claim IDs. |
| Graphiti premise status | Graphiti A: 354/395 | EntityEdge.attributes walk: 383/395 | Full-rubric pass count | +29 scenarios | Native graph edges alone are not enough; retrieval must walk them. |
| False-stale specificity | E.6 near-misses | Query-scoped temporal comparison | False stale count | 0/28 on storage walks | Marking every superseded chain stale fails this control. |
Limitations#
- Depends on write-path quality: missing provenance links reduce recall of stale-chain detection.
- Needs stable claim IDs across ingestion, compaction, and deduplication.
- Deep or noisy provenance graphs need cycle handling, fan-out caps, and observability around skipped links.
- Benchmark scores are controlled evaluations; production distributions and noise can differ.
Concurrent work I missed#
When this series went out I had not seen two Mem0 blog posts that bear on it
directly: Introducing the Token-Efficient Memory
Algorithm
(April 16, 2026) and The Token-Efficient Memory Algorithm Now Has Temporal
Reasoning
(May 14, 2026). The May post adds a state-key + event_end model for
ongoing facts — a partial valid-time analogue, with write-time temporal
metadata and reranking (rather than filtering) at read time.
The work overlaps with the temporal half of this benchmark's contract but does not subsume it. Mem0's update tracks one clock; the benchmark requires two (valid time and transaction time, so retroactive corrections remain distinguishable from real-time updates). It also has no documented slot for per-claim epistemic status or for the derivation-closure walk that closes the E.5 family. The May post adds a memory decay signal alongside the state-key model — recent memories are boosted up to 1.5× and stale ones dampened to 0.3× at ranking time — but that is a recency knob on retrieval, not a propagation of supersession through a premise chain; a dampened claim is still returned, just lower, and still without a status. The May post's own numbers are consistent with that boundary: +3.8 points on LongMemEval temporal reasoning, but −2.6 points on the knowledge-update category — reranking surfaces a newer dated instance, but does not, by itself, propagate a status change through a premise chain.
Since first publishing, I have measured that release directly rather than only reading about it. Mem0 documents both the temporal-reasoning surface and memory decay as platform-only features on top of the open-source base algorithm, so they are exactly what the Mem0 Platform v3 (hosted) row in the scoreboard above evaluates — 22/395, with E.5 still 0/29. The open-source Mem0 Mode A / B / C rows reflect the base-algorithm version from the paper; the hosted row is the temporal-reasoning platform. Both land on the same side of the boundary.
References#
- Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956.
- Chhikara, P., Khant, D., Aryan, S., Singh, T., & Yadav, D. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413.
- Marković, V., Obradović, L., Hajdu, L., & Pavlović, J. (2025). Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning. arXiv:2505.24478.
Systems referenced#
The scoreboard chart in this article evaluates the following memory systems. Each row is run against its public production interface; configuration detail is in the paper.
- Graphiti (OSS) — temporal knowledge-graph memory library for LLM agents, run locally with a wrapper-side filter. github.com/getzep/graphiti.
- Zep Cloud (hosted) — the managed product from the same team, queried through its native
graph.search()valid_at/created_atdate filters and itsscope=autocontext block. getzep.com. - Mem0 (OSS) — write-time-extraction memory layer (local), with a structured API and an NL retrieval call. mem0.ai · github.com/mem0ai/mem0.
- Mem0 Platform v3 (hosted) — the managed product's temporal-reasoning surface (state-key +
event_end, memory decay), documented as platform-only. mem0.ai. - TencentDB-Agent-Memory — local SQLite-backed layered memory with L0 conversation search, L1 structured memories, and traceable persona / scenario summaries. github.com/Tencent/TencentDB-Agent-Memory.
- Supermemory — write-time extraction into a memory graph with documented
update/extend/deriverelation labels. supermemory.ai. - Hindsight — typed four-network agent memory with
retain/recall/reflectoperations. github.com/vectorize-io/hindsight. - Cognee — graph + vector memory engine with
GRAPH_COMPLETION,CHUNKS, andTEMPORALsearch modes. github.com/topoteretes/cognee. - Honcho 3 — managed personal-memory service driven through a Dialectic Agent. honcho.dev.