18 May 202614 min read3,119 words

#agent#memory#knowledge-graph#benchmark#evaluation

The Missing Step in Reliable Agent Memory

A small query-time walk over declared derivation links closes the benchmark's hardest failure family, deterministically, across every storage backend I tested.

The hardest scenarios in the benchmark are not the ones where the memory has the wrong answer. They are the ones where the memory has the right answer, with the right citation, and still cannot tell that the answer is no longer safe to act on. The benchmark calls this family — cascading staleness — and every base system I tested scored zero on it.

This is the third piece in a three-part series. The first piece framed the problem: a memory can return an answer that is still in the store, still plausible, and still resting on a premise that has been quietly superseded. The second piece showed that which API the agent uses against the same store decides whether the dependency structure behind a claim is even visible — structured access lifts you into the high 350s out of 395, but it still won't catch the case where a premise three hops back has changed. That gap is what this article is about.

Here is the standalone version of the example. The store has an answer claim — Maya Patel has full signing authority at Apple. That claim derives from Maya Patel is chief of staff, which derives from Tim Cook is CEO. Later, the memory learns that Priya Raman became CEO on November 1, 2027. The answer claim has not disappeared, but one upstream premise is no longer current.

Benchmark row E.5

One answer, two different memory states.

Question

What is the signing_authority of Maya Patel?

Expected answer: full
Expected citation: ap_d3
Expected status: POTENTIALLY_STALE

Prompt rule used by the LLM walk

For every doc_id p in c.derived_from, transitively, look up the premise document p.
For each premise, find the current claim for the same entity and predicate at the queried (T_v, T_t).
If that current claim's value differs from p.value, the premise is superseded.
If c is valid and any transitive premise is superseded, set epistemic_status = POTENTIALLY_STALE while still returning c.value.

Evidence chain

Returned claim

ap_d3 / valid now

Maya Patel holds full signing authority for routine corporate matters at Apple.

Direct premise

ap_d2 / depends on ap_01

Maya Patel serves as chief of staff to Apple's CEO Tim Cook.

Transitive premise

ap_01 / superseded

Tim Cook continues as CEO of Apple through most of 2027.

Current successor

ap_03 / current

Apple appoints Priya Raman as Chief Executive Officer effective November 1, 2027.

Row store

No premise walk

fails status

answer: full
primary_document_id: ap_d3
citation_level: 3 / 3
epistemic_status: UNVERIFIED

The value and citation are right, but the interface treats the claim as a clean single-source fact.

Sonnet 4.6

Prompted premise walk

passes

answer: full
primary_document_id: ap_d3
citation_level: 3 / 3
epistemic_status: POTENTIALLY_STALE

When the prompt surfaces `derived_from` and asks for the transitive walk, the same answer carries the warning.

Row store

Recursive CTE walk

passes

answer: full
primary_document_id: ap_d3
citation_level: 3 / 3
epistemic_status: POTENTIALLY_STALE

The storage walk produces the same status without spending tokens or asking a model to reason through the chain.

The same pattern appears outside the executive-delegation chain. Here is the NVIDIA acquisition-freeze row from the corpus, shown with the prior documents the system sees, the prompt contract asking for epistemic status, and the structured answer that passes or fails the audit.

Benchmark rows E.5 / E.6

The prompt asks for status. The read path still has to earn it.

Prior ingested documents

Premise

nv_06

The European Commission opens a formal antitrust review of NVIDIA's GPU licensing practices effective August 1, 2027.

predicate: regulatory_status

value: under_antitrust_review

Derived answer

nv_d2

Pursuant to NVIDIA's antitrust review status, an internal acquisition freeze takes effect 2027-08-01 and remains in place pending regulatory resolution.

predicate: acquisition_freeze

value: active

derived_from: nv_06

Successor

nv_07

The European Commission concludes its NVIDIA antitrust review with no enforcement action; status returns to cleared as of February 1, 2028.

predicate: regulatory_status

value: cleared

supersedes: nv_06

Representative prompt contract

You are answering from an auditable memory store.
Use the ingested claim records and their document_id, valid-time interval, transaction time, and derived_from links.
Return answer, primary_document_id, citation_level, and epistemic_status.
epistemic_status must be one of UNVERIFIED, CORROBORATED, CONTESTED, ENRICHED, or POTENTIALLY_STALE.
If the answer claim derives from another claim, transitively inspect each premise at the query valid time and transaction time.
If any transitive premise has a different current value at the query point, keep the answer but set epistemic_status to POTENTIALLY_STALE.

Baseline failure

What is the acquisition_freeze of NVIDIA?

fails status

answer: active
primary_document_id: nv_d2
citation_level: 3
epistemic_status: UNVERIFIED

The answer and citation are right, but the memory state is wrong. The read path returned nv_d2 without walking back to nv_06 and comparing that premise with its current successor nv_07.

The fix is a query-time walk over declared derived_from links. It is small enough to write in an afternoon, and the score lift is the same on every substrate that admits it.

Score lift

Before / after the premise walk.

Headline-corpus scores out of 395, before and after adding a query-time walk over declared derivation links. The lift is consistent across stores because the algorithm is the same; the store only determines how the edges are persisted.

The Walk#

A derived_from link is just an authored field on each claim: when the ingest step records a new claim that was inferred from an earlier one, it writes down which earlier claim it depended on. The procedure is: retrieve the candidate claim. Walk back along those declared derived_from edges. For each premise on the chain, ask the store what the current value is for that (entity, predicate) pair under the query's valid-time and transaction-time bounds. If any premise has been superseded, flip the answer's epistemic status to POTENTIALLY_STALE.

Scroll explainer

Walking the chain, one premise at a time.

Premise walk

POTENTIALLY_STALE

Jan 15 — answer claim

Maya Patel has signing authority

Derived from the chief-of-staff claim.

derived_from

Jan 8 — direct premise

Maya Patel is chief of staff

Derived from the CEO claim.

derived_from

Jan 1 — root premise

Tim Cook is CEO of Apple

The premise the chain ultimately rests on.

current claim for (Apple, ceo, query time) →

Jun 1 — current premise

Sarah Chen is CEO of Apple

Supersedes Tim Cook. The chain's root premise has changed.

Step 1

Start at the answer claim.

The query returns Maya Patel's signing authority. A safe interface treats this as the start of a walk, not the end of one.

Step 2

Hop back along derived_from.

The signing-authority claim was derived from a chief-of-staff claim, which was derived from a CEO claim. The walk follows the declared dependency edges.

Step 3

Compare each premise to the current value.

For every (entity, predicate) on the chain, ask the store: who is current under the query time bounds? Tim Cook was the premise. The current premise is Sarah Chen.

Step 4

If any premise was superseded, flip the status.

The answer claim is not deleted. It is still returned. But the epistemic status changes from UNVERIFIED to POTENTIALLY_STALE, because its support chain now passes through a superseded premise.

Step 1

Start at the answer claim.

The query returns Maya Patel's signing authority. A safe interface treats this as the start of a walk, not the end of one.

Step 2

Hop back along derived_from.

The signing-authority claim was derived from a chief-of-staff claim, which was derived from a CEO claim. The walk follows the declared dependency edges.

Step 3

Compare each premise to the current value.

For every (entity, predicate) on the chain, ask the store: who is current under the query time bounds? Tim Cook was the premise. The current premise is Sarah Chen.

Step 4

If any premise was superseded, flip the status.

The answer claim is not deleted. It is still returned. But the epistemic status changes from UNVERIFIED to POTENTIALLY_STALE, because its support chain now passes through a superseded premise.

That is all of it. There is no machine learning, no embedding, no reranker, no calibration. The whole thing is a recursive lookup conditional on a deterministic time filter.

Four Substrates, One Algorithm#

What surprised me was how cleanly the procedure transfers. I implemented it on a SQLite row store as a recursive SQL query, on Mem0 by walking the metadata.derived_from field in Python, on Graphiti by walking the same kind of field on its edge objects, and on the model as a prompt rule with extended thinking turned on (a setting that lets the model spend more tokens on internal reasoning before answering). The same lift appears in all four.

Same algorithm

Four substrates. One walk. The same lift.

The walk

Substrate

Row-store (SQLite)

recursive SQL query over claim_derivations

before

359

after

388

lift+29

Substrate

Mem0

walk metadata.derived_from

before

359

after

388

lift+29

Substrate

Graphiti

walk EntityEdge.attributes

before

354

after

383

lift+29

Substrate

Sonnet 4.6 (in-context)

prompt rule + extended thinking

before

359

after

388

lift+29

The walk is implemented as a recursive SQL query on SQLite, an in-Python edge walk on Mem0 and Graphiti, and a prompt rule for the Sonnet 4.6 model with extended thinking already on. All four close cascading staleness on the headline corpus. The constant lift is the point: this is a read-path algorithm, not a storage-engine feature, and it transfers cleanly across stores.

This is the central read of the result: cascading staleness is a read-path algorithm, not a storage feature. As long as the write path has authored the dependency edges and the read path is willing to walk them, the execution substrate beneath does not matter.

The Specificity Trap#

The naive fix is to flag any claim whose support chain touches a superseded premise. That catches every cascading-staleness case, but it also catches a lot of cases that aren't actually stale.

Concretely: suppose the agent asks "did Maya have signing authority in January?" The support chain still passes through "Tim Cook is CEO" — a premise that was later superseded, when Priya Raman became CEO on November 1, 2027. A walk that flags every supersession in the chain would mark the January claim POTENTIALLY_STALE. But for the period being asked about, the chain was sound. The supersession sits in the future relative to the query.

This is the specificity test: 28 scenarios where a real supersession exists, but it lies outside the bitemporal scope the agent is asking about. In the January example, it is outside the valid-time window; in a late-correction case, it may be outside the transaction-time knowledge cutoff. A correct walk leaves these alone.

The denominator differs from the headline E.5 count because E.6 is a matched real-entity control: 28 real-entity stale cases get 28 near-misses. The headline E.5 count is 29 because it also includes the Acme worked-example case.

Valid-time scope

Drag the query bound across the supersession.

UNVERIFIED

supersession · Jun 2028

query bound · Sep 2027

Jan 2027Jul 2027Jan 2028Jul 2028

If query bound < supersession (E.6)

The supersession is outside the query's valid-time scope. Status: UNVERIFIED.

If query bound ≥ supersession (E.5)

A premise was superseded inside scope. Status: POTENTIALLY_STALE.

This view moves the query's valid-time bound; the same walk is also scoped by the transaction-time knowledge cutoff. Marking everything stale near a supersession would fail the E.6 specificity control — 28 out of 28 near-misses where the supersession exists in the world but lies outside the bitemporal scope the agent was asking about.

So the walk has to be scoped to the query's bitemporal bounds. For each premise, compare against the value that is current for the same (entity, predicate) at the query's valid time $T_v$ and known by the query's transaction time $T_t$ . A supersession that starts after the period being audited, or that the memory had not learned by the transaction-time cutoff, should not make the January answer stale.

Reasoning Alone Doesn't Close the Gap#

Before turning to whether a model can run the walk by itself, it's worth ruling out the simpler hypothesis: can you just give the model more reasoning budget and have it solve E.5 on its own? I ran three within-subject sweeps — same model, same prompt, varying only the reasoning budget knob the system exposes.

Within-subject reasoning controls

Score moves with reasoning budget. E.5 doesn't.

Three models, three sweeps that vary only the reasoning budget. Total scores move by up to 83 scenarios on the Sonnet 4.6 thinking-on/off pair. Across every setting in every group, E.5 stays at zero out of 29 — you cannot reason your way past cascading staleness without a procedure to actually walk the chain.

Total scores move. Sonnet 4.6 picks up 83 scenarios when extended thinking is turned on; is bit-identical across its three effort levels; wanders within a four-scenario band. But the E.5 column is locked at 0/29 in every single condition. Reasoning budget alone is not what unlocks cascading staleness — you need a procedure that actually walks the chain.

MRAgent's active-reconstruction result points in the same direction: memory access should be an iterative read procedure whose next step depends on evidence already found, not a passive retrieval call fixed by the original query.^[4] The E.5 result is a narrower audit version of that claim. Active traversal can help find the right support, but the benchmark still requires one extra output: whether any premise on the declared derivation path has been superseded under the query's valid-time and transaction-time bounds.

When the LLM Walks the Chain Itself#

If the storage path will not walk the chain, the model has to. I tested this by giving capable long-context LLMs the source documents plus a prompt rule that described the walk, and asked them to execute it themselves. Two pass/fail tests separate the results: did the system catch the genuinely stale case (E.5, sensitivity), and did it leave the alone (E.6, specificity)?

Sensitivity vs specificity

Step 1 of 3 — All systems

Thirteen systems queued for two tests.

Each system in its default setup, plus six variants that walk the dependency chain at query time.

★ Pareto frontier

misses E.5

catches E.5

preserves E.6

over-marks E.6

preserveE.6

over-markE.6

Graphiti A

Mem0 A

Row-store

Det. graph

Sonnet 4.6

GPT-5.5

Mem0 B

Graphiti + walk

Mem0 + walk

Row-store + walk

Sonnet + walk · thinking

gpt-5-mini + walk

Sonnet + walk, no think

no walk — base

no walk — LLM

walk on storage

walk in prompt (capable)

walk in prompt (over-marks)

Two tests with simple pass/fail outcomes on the 28 matched real-entity pairs. E.5 (sensitivity) asks whether the system flagged the genuinely stale case; E.6 (specificity) asks whether it left the near-miss alone. Storage walks land in the top-right cell deterministically. Prompt walks depend on how carefully the model can reason — one variant slips into the bottom-right cell because it marks every near-miss stale.

GPT-5-mini and Sonnet 4.6 with extended thinking on match the storage walk on both tests. Sonnet 4.6 with extended thinking off closes E.5 but marks every near-miss stale — it catches the real case, but it also raises false alarms on cases that aren't actually stale. Smaller models miss both axes. Storage walks land in the top-right cell deterministically: every E.5 caught, every near-miss preserved.

The Full Scoreboard#

Pulling all the systems onto one page: where does each one end up, and what does the walk lift do to the systems that admit it?

Headline scoreboard

All evaluated systems, sorted by full-rubric pass rate.

Reference (oracle)

Deployed memory system

Hosted product (native API)

No-LLM control

In-context LLM

RAG baseline

premise-walk lift (where measured)

Full-rubric pass rate on the 395-scenario headline corpus. The dashed extension on a bar shows where a system lands after the query-time premise walk is added. The reference DuckDB row defines the interface; everything below it is what existing systems return on the same questions.

The DuckDB row at the top is the interface specification — a roughly fifty-line SQL store implementing the formal validity predicate. It defines what 395/395 means; it is not a competitor. Everything below it is a system you might actually deploy. The dashed extensions show the read-path walk lift on the systems where I measured it.

The strongest structured and no-LLM non-walk rows in the chart cap near 359/395 — a 36-scenario gap dominated by the cascading-staleness family (E.5: 29 scenarios). The lower-scoring NL and agentic-memory rows fail additional retrieval or citation axes before the walk even matters. The Before / after the premise walk figure at the top of this article shows what closing the E.5-dominated gap looks like substrate by substrate.

A concrete E.5 failure, using the Apple example above: after the November 1 CEO update, ask “what is the signing_authority of Maya Patel?” The signing-authority claim is in the store with its derived_from link pointing at the chief-of-staff claim, which points back at the Jan 1 “Tim Cook is CEO” premise. The system returns full, with the right citation. The November 1 update where Priya Raman becomes CEO never enters the answer. A reader of the answer cannot tell that the support chain has changed — only the status on the answer should have flipped from UNVERIFIED to POTENTIALLY_STALE.

A few things in the chart are worth pointing out beyond the obvious.

The walk contributes a constant lift of about 29 scenarios across substrates. SQLite row-store 359 → 388, Mem0 A 359 → 388, Graphiti A 354 → 383, Sonnet 4.6 (with extended thinking) 359 → 388. Same procedure, same delta, three storage backends plus an in-context LLM. The walk is substrate-agnostic in a strict sense: the work it does is bounded by the corpus, not by what is underneath.

The Sonnet path has two effects, not one. Sonnet 4.6 in-context starts at 276/395. Turning on extended thinking lifts that to 359/395 (Δ83) without touching E.5 at all — the model just gets better at the rest of the corpus. The chart above isolates the second step: adding the premise-walk prompt on top contributes the same Δ29 the storage walks deliver elsewhere, taking it to 388. The walk and the reasoning budget are doing different jobs; you need both for Sonnet to land at the ceiling.

Reasoning budget alone does not substitute for the walk. Three within-subject controls in the paper hold the model fixed and vary one factor: GPT-5-mini reasoning effort (low / medium / high), Sonnet 4.6 extended thinking (off / on), Honcho 3 reasoning level across five values. Total scores move — sometimes by +83 scenarios — but the E.5 family stays locked at 0/29 in every condition. Whatever extra computation the model spends, it isn't finding the cascading-staleness label on its own.

The walk is adding status, not accuracy. For systems whose base answer+citation score was already high (Mem0 A: 389/395 on answer+citation), the +walk answer+citation barely moves (389 → 389), while the full-rubric jumps (359 → 388). The walk is closing the gap between “I have the right answer” and “I can also say whether the answer is still safe to use” — a pure epistemic-status fix, not a retrieval one.

The score-vs-cost Pareto frontier is the no-LLM walk. The highest-scoring non-reference row on the board (388/395) is also one of the cheapest: row-store SQLite + recursive-CTE walk runs in under a millisecond per query at zero tokens. The GPT-5.5 in-context row reaches 347/395 at roughly 5 s per query with meaningful token spend; agentic NL retrieval (Mem0 Mode B) costs about $0.79 per query for 149/395. High-scoring rows cluster in the 347–395 band but span roughly five orders of magnitude in cost. The walk's leverage is that it doesn't depend on the LLM at all.

Do the Hosted Products Do Better?#

The rows above run the open-source libraries locally, behind a thin wrapper that applies the benchmark's time-and-evidence filter. The fair next question is whether the managed products — the ones you would actually buy — do better through their own native interfaces. So I added three hosted rows: under edge-scope date filters and under its scope=auto context block, and on its temporal-reasoning surface.

They do not. The native surfaces land far below the OSS Mode A wrapper-oracle — Zep Cloud 160/395 and Mem0 Platform v3 22/395, against 354 and 359 — and the reason is the more interesting part.

Hosted current products

Strong retrieval, weak audit — through a native interface.

Zep Cloud (hosted)edge-scope + valid_at / created_at date filters

gap Δ149

audits 160retrieves 309 / 395

Zep Cloud (hosted)scope=auto cross-scope context block

gap Δ197

audits 121retrieves 318 / 395

Mem0 Platform v3 (hosted)temporal reasoning, semantic rerank

gap Δ24

audits 22retrieves 46 / 395

Mem0 (OSS), Mode Alocal library + wrapper-side bitemporal filter

gap Δ30

audits 359retrieves 389 / 395

full-rubric audit

retrieved, but fails the citation / epistemic axes

OSS wrapper-oracle contrast

Out of 395 headline scenarios. Zep Cloud's native valid_at / created_at date filters and provenance-to-source retrieve the correct claim on ~309/395, but only 160 survive the full audit — the dashed remainder is the citation and epistemic-status axes the managed surface does not expose. Run locally with a wrapper-side filter (Mem0 OSS Mode A), retrieval and audit nearly coincide.

Zep markets the surface as “context you can trace, filter, and trust”, and the first two of those are real. Every stored fact links back to the source episode it was read from, and you can filter retrieval by valid_at / created_at date ranges and by metadata such as a verified flag. That is exactly why Zep Cloud retrieves the correct claim on ~309/395 — better than any other deployed comparator in the chart. But trust under change is a separate primitive from trace and filter. Strong retrieval does not confer bitemporal-audit completeness: only 160 of those 309 survive the full rubric, and the dashed remainder is the citation and epistemic-status axes. The scope=auto context block retrieves marginally more (318) but audits less (121), because its cross-scope summary underserves structured citation.

And on the hardest family, the hosted products fail exactly the way everything else does. The shared E.5 gap now spans nine product surfaces — Graphiti and Mem0 open-source, the hosted Zep Cloud and Mem0 Platform v3 current products, plus TencentDB-Agent-Memory, Cognee, Honcho 3, Supermemory, and Hindsight. The hosted Zep Cloud surface is the sharpest illustration: its native date filters retrieve the E.5 answer on most instances, and it still returns UNVERIFIED, never POTENTIALLY_STALE. The products are not failing to find the claim. They are returning it without a signal that its support has moved.

What This Result Doesn't Claim#

The walk works conditional on a write path that authors the derivation edges. The benchmark assumes that input. If your ingest step does not preserve derived_from, the read-path procedure has nothing to walk.

Scope

What the benchmark measures, and what it doesn't.

The benchmark assumes the write path has already authored the derivation edges. It measures whether the read path uses them. Extraction quality is a separate problem; better extraction does not help if retrieval never walks the chain.

That is a separate, parallel problem — write-path extraction quality — and it has its own substantial literature. What the benchmark establishes is that, given the edges, the read path is a small and store-independent piece of code. The hard part is making it standard practice in agent-memory APIs.

The reason it has not been standard is, I suspect, that the failure is invisible from the answer alone. The answer is right. The citation is right. Only the status on that answer is wrong, and most memory APIs don't even have a place to put a status. That is what the benchmark is for. The framing piece draws the parallel to the metacognition argument for parametric models: in both settings, reliability hinges on a per-answer faithful signal, not on aggregate accuracy.

Benchmark Summary#

Task	Baseline	Proposed method	Metric	Delta	Caveat
Row-store premise status	SQLite without walk: 359/395	Recursive-CTE premise walk: 388/395	Full-rubric pass count	+29 scenarios	Requires `claim_derivations` links.
Mem0 premise status	Mem0 A: 359/395	`metadata.derived_from` walk: 388/395	Full-rubric pass count	+29 scenarios	Metadata must preserve stable claim IDs.
Graphiti premise status	Graphiti A: 354/395	`EntityEdge.attributes` walk: 383/395	Full-rubric pass count	+29 scenarios	Native graph edges alone are not enough; retrieval must walk them.
False-stale specificity	E.6 near-misses	Query-scoped temporal comparison	False stale count	0/28 on storage walks	Marking every superseded chain stale fails this control.

Limitations#

Depends on write-path quality: missing provenance links reduce recall of stale-chain detection.
Needs stable claim IDs across ingestion, compaction, and deduplication.
Deep or noisy provenance graphs need cycle handling, fan-out caps, and observability around skipped links.
Benchmark scores are controlled evaluations; production distributions and noise can differ.

Concurrent work I missed#

When this series went out I had not seen two Mem0 blog posts that bear on it directly: Introducing the Token-Efficient Memory Algorithm (April 16, 2026) and The Token-Efficient Memory Algorithm Now Has Temporal Reasoning (May 14, 2026). The May post adds a state-key + event_end model for ongoing facts — a partial valid-time analogue, with write-time temporal metadata and reranking (rather than filtering) at read time.

The work overlaps with the temporal half of this benchmark's contract but does not subsume it. Mem0's update tracks one clock; the benchmark requires two (valid time and transaction time, so retroactive corrections remain distinguishable from real-time updates). It also has no documented slot for per-claim epistemic status or for the derivation-closure walk that closes the E.5 family. The May post adds a memory decay signal alongside the state-key model — recent memories are boosted up to 1.5× and stale ones dampened to 0.3× at ranking time — but that is a recency knob on retrieval, not a propagation of supersession through a premise chain; a dampened claim is still returned, just lower, and still without a status. The May post's own numbers are consistent with that boundary: +3.8 points on LongMemEval temporal reasoning, but −2.6 points on the knowledge-update category — reranking surfaces a newer dated instance, but does not, by itself, propagate a status change through a premise chain.

Since first publishing, I have measured that release directly rather than only reading about it. Mem0 documents both the temporal-reasoning surface and memory decay as platform-only features on top of the open-source base algorithm, so they are exactly what the Mem0 Platform v3 (hosted) row in the scoreboard above evaluates — 22/395, with E.5 still 0/29. The open-source Mem0 Mode A / B / C rows reflect the base-algorithm version from the paper; the hosted row is the temporal-reasoning platform. Both land on the same side of the boundary.

References#

Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956.
Chhikara, P., Khant, D., Aryan, S., Singh, T., & Yadav, D. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413.
Marković, V., Obradović, L., Hajdu, L., & Pavlović, J. (2025). Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning. arXiv:2505.24478.
Ji, S., Li, Y., & Hooi, B. (2026). Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents. arXiv:2606.06036.

Systems referenced#

The scoreboard chart in this article evaluates the following memory systems. Each row is run against its public production interface; configuration detail is in the paper.

Graphiti (OSS) — temporal knowledge-graph memory library for LLM agents, run locally with a wrapper-side filter. github.com/getzep/graphiti.
Zep Cloud (hosted) — the managed product from the same team, queried through its native graph.search() valid_at / created_at date filters and its scope=auto context block. getzep.com.
Mem0 (OSS) — write-time-extraction memory layer (local), with a structured API and an NL retrieval call. mem0.ai · github.com/mem0ai/mem0.
Mem0 Platform v3 (hosted) — the managed product's temporal-reasoning surface (state-key + event_end, memory decay), documented as platform-only. mem0.ai.
TencentDB-Agent-Memory — local SQLite-backed layered memory with L0 conversation search, L1 structured memories, and traceable persona / scenario summaries. github.com/Tencent/TencentDB-Agent-Memory.
Supermemory — write-time extraction into a memory graph with documented update / extend / derive relation labels. supermemory.ai.
Hindsight — typed four-network agent memory with retain / recall / reflect operations. github.com/vectorize-io/hindsight.
Cognee — graph + vector memory engine with GRAPH_COMPLETION, CHUNKS, and TEMPORAL search modes. github.com/topoteretes/cognee.
Honcho 3 — managed personal-memory service driven through a Dialectic Agent. honcho.dev.