7 min read1,491 words
The Interface Resolution Problem in Agent Memory
Why the same stored facts produce different answers depending on the retrieval surface.
Two agents query the same memory store with the same underlying claims. One gets the right answer with a usable epistemic status. The other gets a different answer, or a worse one, or no answer at all. Nothing about the data changed. Only the interface did.
This is the second piece in a three-part series on an agent-memory audit benchmark. The first piece introduced the problem: a stored claim can be answer-correct while a premise three hops back has been quietly superseded, and the system has no way to flag that.
The first piece also fixed the four axes the benchmark uses to grade each scenario:
- Answer — the value returned.
- Citation chain — the supporting claims and documents.
- Epistemic status — one of
UNVERIFIED,CORROBORATED,CONTESTED,ENRICHED, orPOTENTIALLY_STALE. - Reachability pre-flight — the system has to finish ingest and become queryable in bounded time.
For this article, the important status is POTENTIALLY_STALE: the answer may still be
correct, but one of the premises behind it has been superseded. A system that exposes only
the answer can miss that difference.
A scenario passes only when all four pass. The headline corpus is 395 scenarios; numbers like
354/395 below are full-rubric pass counts on it.
The empirical heart of this article is that the same stored claims pass or fail depending on the API the agent uses. The natural assumption is that memory quality is about what is in the store. But on both systems with paired retrieval surfaces, the same store produced very different scores depending on which surface the agent used.
The Interface Levels#
This framing has a lineage. Direct Corpus Interaction (DCI) in recent agentic-search work[1] argues that retrieval quality depends not only on the agent's reasoning ability but on the resolution of the interface through which it touches the corpus — agents constrained to a top-k similarity call cannot verify evidence that a lower-resolution surface has hidden. The same systems principle transposes onto persistent memory. The hidden object here is not a passage or document; it is the temporally scoped dependency closure behind an answer-correct stored belief. The ladder below organizes memory interfaces by what they expose of that structure.
Memory interfaces fall across six resolution levels. Each level adds one capability over the previous: from a bare answer string to direct timestamps to direct provenance and finally to transitive dependency closure. The benchmark measures what each level can and cannot resolve.
Scroll explainer
Six levels of what an interface can expose.
Resolution levels
What the interface exposes
query return at L5
// Tools the agent can call: walk_premises(claim, depth=3) current_value(e, p) range(e, p, t_from, t_to)
Level L0
Answer only
Exposes answer string. Factual recall. No stale-premise detection.
The agent learns the answer is yes, but cannot say where it came from or whether the source is still current.
Level L1
Answer + citation
Exposes source document or text. Grounding. No dependency propagation.
The agent can verify the answer against a specific source, but it cannot tell whether that source itself rests on an earlier claim.
Level L2
Timestamped citation
Exposes valid time + transaction time. Bitemporal tests. No dependency tests.
The agent can audit what the system believed at a past transaction time, but a claim's premises are still invisible.
Level L3
Direct provenance
Exposes direct derived_from premises. Shallow stale-premise checks.
The agent sees the one-hop premise behind a derived claim, but a supersession three hops back is still hidden.
Level L4
Dependency closure
Exposes transitive premises + current comparison. Cascading-staleness detection.
The agent sees the full chain a claim rests on and compares each link to the current value, so a supersession anywhere upstream becomes visible.
Level L5
Direct memory interaction
Exposes bounded traversal tools. Tests whether agents can discover closure.
The agent gets traversal primitives and has to compose the walk itself; the bound is whether it can.
L0 — Answer only
Factual recall. No stale-premise detection.
{
answer: "yes"
}L1 — Answer + citation
Grounding. No dependency propagation.
{
answer: "yes",
citation: "doc_3.txt:L4"
}L2 — Timestamped citation
Bitemporal tests. No dependency tests.
{
answer: "yes",
citation: {
doc: "doc_3.txt:L4",
valid_from: "2027-01-15",
txn_time: "T03"
}
}L3 — Direct provenance
Shallow stale-premise checks.
{
answer: "yes",
citation: { ... },
derived_from: ["doc_2.txt"]
}L4 — Dependency closure
Cascading-staleness detection.
{
answer: "yes",
citation: { ... },
derived_from: ["doc_2"],
closure: ["doc_2", "doc_1"],
current: {
"(Apple, ceo)":
"Sarah Chen"
},
status: "POTENTIALLY_STALE"
}L5 — Direct memory interaction
Tests whether agents can discover closure.
// Tools the agent can call: walk_premises(claim, depth=3) current_value(e, p) range(e, p, t_from, t_to)
The point is not that higher levels are always better. Some agents only need L0. The point is that the level the interface exposes is the upper bound on what the agent can reason about.
L5, the top of the ladder, sits closest to DCI's own setting in agentic search: the agent is given traversal primitives and asked to compose the walk itself, instead of going through a constrained retrieval API.
Same Store, Two Surfaces#
Many production memory systems give you two surfaces against the same store: a structured API and a natural-language retrieval call. The structured API uses an indexed scan and a bitemporal filter. The natural-language path runs embedding similarity and reranking. Both read from the same persisted claims.
Same store
Different interface, different view.
I call the structured path Mode A and the NL path Mode B. Mode C is a control I introduce in a moment.
The 305-Scenario Drop#
When I scored and under both surfaces, the drop was not subtle.
Empirical drop
Mode A vs Mode B on the same store.
Graphiti goes from 354 to 49 out of 395 by changing nothing but the API call. Mem0 goes from 359 to 149. The drop is broad — questions that should return multiple values collapse, point-in-time audits collapse, date-range lookups collapse — not localised to one corner of the test set.
The Privileged Dispatch Objection#
The reasonable objection at this point is that Mode A wins because its regex template parser knows the predicate names and entity IDs ahead of time, while Mode B has to discover them. So I built Mode C: same indexed-scan retrieval as Mode A, but with the regex parser replaced by a JSON dispatcher. The model decides what to look up; the storage path is identical.
Privileged dispatch control
The regex parser is not the cause.
Mode A
Mode A — regex parser
- 1NL question
- 2regex template parser
- 3indexed scan
- 4Φ filter
full-rubric
359/ 395
Mode C
Mode C — LLM dispatcher
- 1NL question
- 2gpt-5-mini JSON dispatcher
- 3indexed scan
- 4Φ filter
full-rubric
357/ 395
Mode B
Mode B — NL retrieval
- 1NL question
- 2hybrid retrieval
full-rubric
149/ 395
On Mem0, Mode C scores 357 while Mode A scores 359: a difference of two scenarios out of 395. On Graphiti, Mode C scores 345 while Mode A scores 354: a difference of nine scenarios. The privileged-dispatch concern is bounded on both substrates. Mode B's drop is not the parser.
Where the Gap Lives#
If the parser is not the cause, the next candidate is the LLM-driven ingest. Mode B uses
Mem0's infer=True ingest, where an LLM rewrites the incoming document into structured
claims. Maybe extraction quality matters more than the retrieval surface. I tested the cross —
Mode A retrieval against the same infer=True ingest — and that run scored 352 out of 395. The
Mode A → Mode B gap of 210 scenarios decomposes:
Where the gap lives
Of the Δ210 drop on Mem0:
Ninety-seven percent of the drop comes from the retrieval surface, not the LLM extraction. The same representation passes or fails depending on which API the agent calls.
The Limit of Structured Access#
Structured access wins on this comparison, but it has its own ceiling. Mode A scores in the high 350s out of 395, not 395. It catches answer-correctness and direct provenance, but it does not walk the chain of premises behind a derived claim. A claim can sit in the store, indexed, retrievable, and answer-correct, while a premise three hops back has been superseded.
That is the failure mode the next article is
about: cascading staleness. A small query-time walk over declared derived_from links closes the
gap on every storage backend I tested. The fix is mechanical. The hard part is making it
standard practice.
The Same Effect on a Shipping Product#
It would be easy to read the Mode A / Mode B split as an artifact of the open-source libraries and
my own wrapper. It isn't. The same interface-resolution effect shows up on a managed product
you can buy today.
exposes native valid_at / created_at date filters and links every stored fact back to its
source — it markets itself as “context you can trace, filter, and
trust”. Through that
interface it retrieves the right claim on ~309 of the 395 scenarios, the best retrieval of any
deployed comparator. But only 160 pass the full audit: the native surface resolves answers and
dates, not the citation and epistemic structure behind them. Same lesson, no wrapper involved
— which is why the scoreboard now labels the local runs (OSS) and the managed runs
(hosted). The third article
takes the hosted products apart in detail.
Benchmark Summary#
| Task | Baseline | Control or proposed method | Metric | Delta | Caveat |
|---|---|---|---|---|---|
| Graphiti retrieval surface | Mode A structured query: 354/395 | Mode B natural-language retrieval: 49/395 | Full-rubric pass count | -305 scenarios | Same store; only retrieval surface changes. |
| Mem0 retrieval surface | Mode A structured query: 359/395 | Mode B natural-language retrieval: 149/395 | Full-rubric pass count | -210 scenarios | Mode B also uses infer=True, so the next row isolates ingest. |
| Mem0 ingest extraction | Mode A with infer=False: 359/395 | Mode A over infer=True ingest: 352/395 | Full-rubric pass count | -7 scenarios | 97% of the A/B drop remains retrieval-surface. |
| Dispatch privilege | Graphiti A: 354/395; Mem0 A: 359/395 | Graphiti C: 345/395; Mem0 C: 357/395 | Full-rubric pass count | -9 and -2 scenarios | The dispatcher is an LLM; the indexed scan stays fixed. |
References#
- Li, Z., Zhang, H., Wei, C., Lu, P., Nie, P., Lu, Y., Bai, Y., Feng, S., Zhu, H., Zhong, M., Zhang, Y., Xie, J., Choi, Y., Zou, J., Han, J., Chen, W., Lin, J., Jiang, D., & Zhang, Y. (2026). Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction. arXiv:2605.05242.
- Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956. Repo: github.com/getzep/graphiti.
- Chhikara, P., Khant, D., Aryan, S., Singh, T., & Yadav, D. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413. Site: mem0.ai.
Limitations#
- Requires a query schema. If entities and predicates are not canonicalized, Mode C can dispatch to the wrong coordinate.
- Benchmark scores are controlled evaluations; production distributions and corpus noise can widen or shrink the gap.
- Structured access still does not solve cascading staleness by itself; it needs the premise walk in the next article.
- Mem0's retrieval algorithm continued to evolve through April–May 2026 (blog 1, blog 2) — both predate this article and I had not seen them at write time. The Mode A / Mode B numbers here reflect the version evaluated in the paper, not the later state-key +
event_endreranking layer. The next article discusses how that update relates to the benchmark.