Kai Hirota

7 min read1,491 words

The Interface Resolution Problem in Agent Memory

Why the same stored facts produce different answers depending on the retrieval surface.

Two agents query the same memory store with the same underlying claims. One gets the right answer with a usable epistemic status. The other gets a different answer, or a worse one, or no answer at all. Nothing about the data changed. Only the interface did.

This is the second piece in a three-part series on an agent-memory audit benchmark. The first piece introduced the problem: a stored claim can be answer-correct while a premise three hops back has been quietly superseded, and the system has no way to flag that.

The first piece also fixed the four axes the benchmark uses to grade each scenario:

  1. Answer — the value returned.
  2. Citation chain — the supporting claims and documents.
  3. Epistemic status — one of UNVERIFIED, CORROBORATED, CONTESTED, ENRICHED, or POTENTIALLY_STALE.
  4. Reachability pre-flight — the system has to finish ingest and become queryable in bounded time.

For this article, the important status is POTENTIALLY_STALE: the answer may still be correct, but one of the premises behind it has been superseded. A system that exposes only the answer can miss that difference.

A scenario passes only when all four pass. The headline corpus is 395 scenarios; numbers like 354/395 below are full-rubric pass counts on it.

The empirical heart of this article is that the same stored claims pass or fail depending on the API the agent uses. The natural assumption is that memory quality is about what is in the store. But on both systems with paired retrieval surfaces, the same store produced very different scores depending on which surface the agent used.

The Interface Levels#

This framing has a lineage. Direct Corpus Interaction (DCI) in recent agentic-search work[1] argues that retrieval quality depends not only on the agent's reasoning ability but on the resolution of the interface through which it touches the corpus — agents constrained to a top-k similarity call cannot verify evidence that a lower-resolution surface has hidden. The same systems principle transposes onto persistent memory. The hidden object here is not a passage or document; it is the temporally scoped dependency closure behind an answer-correct stored belief. The ladder below organizes memory interfaces by what they expose of that structure.

Memory interfaces fall across six resolution levels. Each level adds one capability over the previous: from a bare answer string to direct timestamps to direct provenance and finally to transitive dependency closure. The benchmark measures what each level can and cannot resolve.

Six levels of what an interface can expose.

L0Answer only

Factual recall. No stale-premise detection.

{
  answer: "yes"
}

L1Answer + citation

Grounding. No dependency propagation.

{
  answer: "yes",
  citation: "doc_3.txt:L4"
}

L2Timestamped citation

Bitemporal tests. No dependency tests.

{
  answer: "yes",
  citation: {
    doc: "doc_3.txt:L4",
    valid_from: "2027-01-15",
    txn_time: "T03"
  }
}

L3Direct provenance

Shallow stale-premise checks.

{
  answer: "yes",
  citation: { ... },
  derived_from: ["doc_2.txt"]
}

L4Dependency closure

Cascading-staleness detection.

{
  answer: "yes",
  citation: { ... },
  derived_from: ["doc_2"],
  closure: ["doc_2", "doc_1"],
  current: {
    "(Apple, ceo)":
      "Sarah Chen"
  },
  status: "POTENTIALLY_STALE"
}

L5Direct memory interaction

Tests whether agents can discover closure.

// Tools the agent can call:
walk_premises(claim, depth=3)
current_value(e, p)
range(e, p, t_from, t_to)

The point is not that higher levels are always better. Some agents only need L0. The point is that the level the interface exposes is the upper bound on what the agent can reason about.

L5, the top of the ladder, sits closest to DCI's own setting in agentic search: the agent is given traversal primitives and asked to compose the walk itself, instead of going through a constrained retrieval API.

Same Store, Two Surfaces#

Many production memory systems give you two surfaces against the same store: a structured API and a natural-language retrieval call. The structured API uses an indexed scan and a bitemporal filter. The natural-language path runs embedding similarity and reranking. Both read from the same persisted claims.

Same store

Different interface, different view.

querysame storeNL retrievalStructured APIMode B agentMode A agent
The store and its claims are the same. The retrieval interface decides what the agent sees and what kinds of question it can answer.

I call the structured path Mode A and the NL path Mode B. Mode C is a control I introduce in a moment.

The 305-Scenario Drop#

When I scored and under both surfaces, the drop was not subtle.

Empirical drop

Mode A vs Mode B on the same store.

395 (full)Mode AMode BGraphiti (OSS)35449Δ305Mem0 (OSS)359149Δ210
Scores out of 395 on the headline corpus. Mode A is a structured API; Mode B is the same store accessed via natural-language retrieval. The drop is the cost of the lower-resolution interface.

Graphiti goes from 354 to 49 out of 395 by changing nothing but the API call. Mem0 goes from 359 to 149. The drop is broad — questions that should return multiple values collapse, point-in-time audits collapse, date-range lookups collapse — not localised to one corner of the test set.

The Privileged Dispatch Objection#

The reasonable objection at this point is that Mode A wins because its regex template parser knows the predicate names and entity IDs ahead of time, while Mode B has to discover them. So I built Mode C: same indexed-scan retrieval as Mode A, but with the regex parser replaced by a JSON dispatcher. The model decides what to look up; the storage path is identical.

Privileged dispatch control

The regex parser is not the cause.

A vs C: Δ2 / Δ9A vs B: Δ210 / Δ305

Mode A

Mode A — regex parser

  1. 1NL question
  2. 2regex template parser
  3. 3indexed scan
  4. 4Φ filter

full-rubric

359/ 395

Mode C

Mode C — LLM dispatcher

  1. 1NL question
  2. 2gpt-5-mini JSON dispatcher
  3. 3indexed scan
  4. 4Φ filter

full-rubric

357/ 395

Mode B

Mode B — NL retrieval

  1. 1NL question
  2. 2hybrid retrieval

full-rubric

149/ 395

Mode C replaces Mode A's regex template parser with a gpt-5-mini JSON dispatcher and keeps the same indexed-scan retrieval. On Mem0, the score moves by 2 out of 395; the same control on Graphiti moves by 9. The Mode B drop is not a parser artefact.

On Mem0, Mode C scores 357 while Mode A scores 359: a difference of two scenarios out of 395. On Graphiti, Mode C scores 345 while Mode A scores 354: a difference of nine scenarios. The privileged-dispatch concern is bounded on both substrates. Mode B's drop is not the parser.

Where the Gap Lives#

If the parser is not the cause, the next candidate is the LLM-driven ingest. Mode B uses Mem0's infer=True ingest, where an LLM rewrites the incoming document into structured claims. Maybe extraction quality matters more than the retrieval surface. I tested the cross — Mode A retrieval against the same infer=True ingest — and that run scored 352 out of 395. The Mode A → Mode B gap of 210 scenarios decomposes:

Where the gap lives

Of the Δ210 drop on Mem0:

Holding Mem0's LLM-extraction ingest constant and varying only the retrieval surface, 97% of the Mode A → Mode B drop comes from the retrieval surface. Better extraction is not the lever; better read-path is.

Ninety-seven percent of the drop comes from the retrieval surface, not the LLM extraction. The same representation passes or fails depending on which API the agent calls.

The Limit of Structured Access#

Structured access wins on this comparison, but it has its own ceiling. Mode A scores in the high 350s out of 395, not 395. It catches answer-correctness and direct provenance, but it does not walk the chain of premises behind a derived claim. A claim can sit in the store, indexed, retrievable, and answer-correct, while a premise three hops back has been superseded.

That is the failure mode the next article is about: cascading staleness. A small query-time walk over declared derived_from links closes the gap on every storage backend I tested. The fix is mechanical. The hard part is making it standard practice.

The Same Effect on a Shipping Product#

It would be easy to read the Mode A / Mode B split as an artifact of the open-source libraries and my own wrapper. It isn't. The same interface-resolution effect shows up on a managed product you can buy today. exposes native valid_at / created_at date filters and links every stored fact back to its source — it markets itself as “context you can trace, filter, and trust”. Through that interface it retrieves the right claim on ~309 of the 395 scenarios, the best retrieval of any deployed comparator. But only 160 pass the full audit: the native surface resolves answers and dates, not the citation and epistemic structure behind them. Same lesson, no wrapper involved — which is why the scoreboard now labels the local runs (OSS) and the managed runs (hosted). The third article takes the hosted products apart in detail.

Benchmark Summary#

TaskBaselineControl or proposed methodMetricDeltaCaveat
Graphiti retrieval surfaceMode A structured query: 354/395Mode B natural-language retrieval: 49/395Full-rubric pass count-305 scenariosSame store; only retrieval surface changes.
Mem0 retrieval surfaceMode A structured query: 359/395Mode B natural-language retrieval: 149/395Full-rubric pass count-210 scenariosMode B also uses infer=True, so the next row isolates ingest.
Mem0 ingest extractionMode A with infer=False: 359/395Mode A over infer=True ingest: 352/395Full-rubric pass count-7 scenarios97% of the A/B drop remains retrieval-surface.
Dispatch privilegeGraphiti A: 354/395; Mem0 A: 359/395Graphiti C: 345/395; Mem0 C: 357/395Full-rubric pass count-9 and -2 scenariosThe dispatcher is an LLM; the indexed scan stays fixed.

References#

  1. Li, Z., Zhang, H., Wei, C., Lu, P., Nie, P., Lu, Y., Bai, Y., Feng, S., Zhu, H., Zhong, M., Zhang, Y., Xie, J., Choi, Y., Zou, J., Han, J., Chen, W., Lin, J., Jiang, D., & Zhang, Y. (2026). Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction. arXiv:2605.05242.
  2. Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956. Repo: github.com/getzep/graphiti.
  3. Chhikara, P., Khant, D., Aryan, S., Singh, T., & Yadav, D. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413. Site: mem0.ai.

Limitations#

  • Requires a query schema. If entities and predicates are not canonicalized, Mode C can dispatch to the wrong coordinate.
  • Benchmark scores are controlled evaluations; production distributions and corpus noise can widen or shrink the gap.
  • Structured access still does not solve cascading staleness by itself; it needs the premise walk in the next article.
  • Mem0's retrieval algorithm continued to evolve through April–May 2026 (blog 1, blog 2) — both predate this article and I had not seen them at write time. The Mode A / Mode B numbers here reflect the version evaluated in the paper, not the later state-key + event_end reranking layer. The next article discusses how that update relates to the benchmark.