17 May 20267 min read1,595 words

#agent#memory#api#mcp#retrieval#evaluation

The Interface Resolution Problem in Agent Memory

Why the same stored facts produce different answers depending on the retrieval surface.

Two agents query the same memory store with the same underlying claims. One gets the right answer with a usable epistemic status. The other gets a different answer, or a worse one, or no answer at all. Nothing about the data changed. Only the interface did.

This is the second piece in a three-part series on an agent-memory audit benchmark. The first piece introduced the problem: a stored claim can be answer-correct while a premise three hops back has been quietly superseded, and the system has no way to flag that.

The first piece also fixed the four axes the benchmark uses to grade each scenario:

Answer — the value returned.
Citation chain — the supporting claims and documents.
Epistemic status — one of UNVERIFIED, CORROBORATED, CONTESTED, ENRICHED, or POTENTIALLY_STALE.
Reachability pre-flight — the system has to finish ingest and become queryable in bounded time.

For this article, the important status is POTENTIALLY_STALE: the answer may still be correct, but one of the premises behind it has been superseded. A system that exposes only the answer can miss that difference.

A scenario passes only when all four pass. The headline corpus is 395 scenarios; numbers like 354/395 below are full-rubric pass counts on it.

The empirical heart of this article is that the same stored claims pass or fail depending on the API the agent uses. The natural assumption is that memory quality is about what is in the store. But on both systems with paired retrieval surfaces, the same store produced very different scores depending on which surface the agent used.

The Interface Levels#

This framing has a lineage. Direct Corpus Interaction (DCI) in recent agentic-search work^[1] argues that retrieval quality depends not only on the agent's reasoning ability but on the resolution of the interface through which it touches the corpus — agents constrained to a top-k similarity call cannot verify evidence that a lower-resolution surface has hidden. The same systems principle transposes onto persistent memory. The hidden object here is not a passage or document; it is the temporally scoped dependency closure behind an answer-correct stored belief. The ladder below organizes memory interfaces by what they expose of that structure.

Memory interfaces fall across six resolution levels. Each level adds one capability over the previous: from a bare answer string to direct timestamps to direct provenance and finally to transitive dependency closure. The benchmark measures what each level can and cannot resolve.

Scroll explainer

Six levels of what an interface can expose.

Resolution levels

What the interface exposes

L0Answer only

L1Answer + citation

L2Timestamped citation

L3Direct provenance

L4Dependency closure

L5Direct memory interaction

query return at L5

// Tools the agent can call:
walk_premises(claim, depth=3)
current_value(e, p)
range(e, p, t_from, t_to)

Level L0

Answer only

Exposes answer string. Factual recall. No stale-premise detection.

The agent learns the answer is yes, but cannot say where it came from or whether the source is still current.

Level L1

Answer + citation

Exposes source document or text. Grounding. No dependency propagation.

The agent can verify the answer against a specific source, but it cannot tell whether that source itself rests on an earlier claim.

Level L2

Timestamped citation

Exposes valid time + transaction time. Bitemporal tests. No dependency tests.

The agent can audit what the system believed at a past transaction time, but a claim's premises are still invisible.

Level L3

Direct provenance

Exposes direct derived_from premises. Shallow stale-premise checks.

The agent sees the one-hop premise behind a derived claim, but a supersession three hops back is still hidden.

Level L4

Dependency closure

Exposes transitive premises + current comparison. Cascading-staleness detection.

The agent sees the full chain a claim rests on and compares each link to the current value, so a supersession anywhere upstream becomes visible.

Level L5

Direct memory interaction

Exposes bounded traversal tools. Tests whether agents can discover closure.

The agent gets traversal primitives and has to compose the walk itself; the bound is whether it can.

L0 — Answer only

Factual recall. No stale-premise detection.

{
  answer: "yes"
}

L1 — Answer + citation

Grounding. No dependency propagation.

{
  answer: "yes",
  citation: "doc_3.txt:L4"
}

L2 — Timestamped citation

Bitemporal tests. No dependency tests.

{
  answer: "yes",
  citation: {
    doc: "doc_3.txt:L4",
    valid_from: "2027-01-15",
    txn_time: "T03"
  }
}

L3 — Direct provenance

Shallow stale-premise checks.

{
  answer: "yes",
  citation: { ... },
  derived_from: ["doc_2.txt"]
}

L4 — Dependency closure

Cascading-staleness detection.

{
  answer: "yes",
  citation: { ... },
  derived_from: ["doc_2"],
  closure: ["doc_2", "doc_1"],
  current: {
    "(Apple, ceo)":
      "Sarah Chen"
  },
  status: "POTENTIALLY_STALE"
}

L5 — Direct memory interaction

Tests whether agents can discover closure.

// Tools the agent can call:
walk_premises(claim, depth=3)
current_value(e, p)
range(e, p, t_from, t_to)

The point is not that higher levels are always better. Some agents only need L0. The point is that the level the interface exposes is the upper bound on what the agent can reason about.

L5, the top of the ladder, sits closest to DCI's own setting in agentic search: the agent is given traversal primitives and asked to compose the walk itself, instead of going through a constrained retrieval API.

Same Store, Two Surfaces#

Many production memory systems give you two surfaces against the same store: a structured API and a natural-language retrieval call. The structured API uses an indexed scan and a bitemporal filter. The natural-language path runs embedding similarity and reranking. Both read from the same persisted claims.

Same store

Different interface, different view.

The store and its claims are the same. The retrieval interface decides what the agent sees and what kinds of question it can answer.

I call the structured path Mode A and the NL path Mode B. Mode C is a control I introduce in a moment.

The 305-Scenario Drop#

When I scored and under both surfaces, the drop was not subtle.

Empirical drop

Mode A vs Mode B on the same store.

Scores out of 395 on the headline corpus. Mode A is a structured API; Mode B is the same store accessed via natural-language retrieval. The drop is the cost of the lower-resolution interface.

Graphiti goes from 354 to 49 out of 395 by changing nothing but the API call. Mem0 goes from 359 to 149. The drop is broad — questions that should return multiple values collapse, point-in-time audits collapse, date-range lookups collapse — not localised to one corner of the test set.

The Privileged Dispatch Objection#

The reasonable objection at this point is that Mode A wins because its regex template parser knows the predicate names and entity IDs ahead of time, while Mode B has to discover them. So I built Mode C: same indexed-scan retrieval as Mode A, but with the regex parser replaced by a JSON dispatcher. The model decides what to look up; the storage path is identical.

Privileged dispatch control

The regex parser is not the cause.

A vs C: Δ2 / Δ9A vs B: Δ210 / Δ305

Mode A

Mode A — regex parser

1NL question
2regex template parser
3indexed scan
4bitemporal filter

full-rubric

359/ 395

Mode C

Mode C — LLM dispatcher

1NL question
2gpt-5-mini JSON dispatcher
3indexed scan
4bitemporal filter

full-rubric

357/ 395

Mode B

Mode B — NL retrieval

1NL question
2hybrid retrieval

full-rubric

149/ 395

Mode C replaces Mode A's regex template parser with a gpt-5-mini JSON dispatcher and keeps the same indexed-scan retrieval. On Mem0, the score moves by 2 out of 395; the same control on Graphiti moves by 9. The Mode B drop is not a parser artefact.

On Mem0, Mode C scores 357 while Mode A scores 359: a difference of two scenarios out of 395. On Graphiti, Mode C scores 345 while Mode A scores 354: a difference of nine scenarios. The privileged-dispatch concern is bounded on both substrates. Mode B's drop is not the parser.

Where the Gap Lives#

If the parser is not the cause, the next candidate is the LLM-driven ingest. Mode B uses Mem0's infer=True ingest, where an LLM rewrites the incoming document into structured claims. Maybe extraction quality matters more than the retrieval surface. I tested the cross — Mode A retrieval against the same infer=True ingest — and that run scored 352 out of 395. The Mode A → Mode B gap of 210 scenarios decomposes:

Where the gap lives

Of the Δ210 drop on Mem0:

Δ210

retrieval

203

extraction

retrieval surface · 203 (97%)

ingest extraction · 7 (3%)

Holding Mem0's LLM-extraction ingest constant and varying only the retrieval surface, 97% of the Mode A → Mode B drop comes from the retrieval surface. Better extraction is not the lever; better read-path is.

Ninety-seven percent of the drop comes from the retrieval surface, not the LLM extraction. The same representation passes or fails depending on which API the agent calls.

The Limit of Structured Access#

Structured access wins on this comparison, but it has its own ceiling. Mode A scores in the high 350s out of 395, not 395. It catches answer-correctness and direct provenance, but it does not walk the chain of premises behind a derived claim. A claim can sit in the store, indexed, retrievable, and answer-correct, while a premise three hops back has been superseded.

This boundary is close to the one drawn by MRAgent, which argues that graph memory should be read through active reconstruction rather than one-shot retrieval or fixed neighborhood expansion.^[4] That is the same interface-resolution lesson from the other side: a graph representation helps only if the read interface can adapt as intermediate evidence changes what should be inspected next. For this benchmark, adaptive graph traversal is still not the whole contract; the answer also needs a bitemporal status derived from the support chain.

That is the failure mode the next article is about: cascading staleness. A small query-time walk over declared derived_from links closes the gap on every storage backend I tested. The fix is mechanical. The hard part is making it standard practice.

The Same Effect on a Shipping Product#

It would be easy to read the Mode A / Mode B split as an artifact of the open-source libraries and my own wrapper. It isn't. The same interface-resolution effect shows up on a managed product you can buy today. exposes native valid_at / created_at date filters and links every stored fact back to its source — it markets itself as “context you can trace, filter, and trust”. Through that interface it retrieves the right claim on ~309 of the 395 scenarios, the best retrieval of any deployed comparator. But only 160 pass the full audit: the native surface resolves answers and dates, not the citation and epistemic structure behind them. Same lesson, no wrapper involved — which is why the scoreboard now labels the local runs (OSS) and the managed runs (hosted). The third article takes the hosted products apart in detail.

Benchmark Summary#

Task	Baseline	Control or proposed method	Metric	Delta	Caveat
Graphiti retrieval surface	Mode A structured query: 354/395	Mode B natural-language retrieval: 49/395	Full-rubric pass count	-305 scenarios	Same store; only retrieval surface changes.
Mem0 retrieval surface	Mode A structured query: 359/395	Mode B natural-language retrieval: 149/395	Full-rubric pass count	-210 scenarios	Mode B also uses `infer=True`, so the next row isolates ingest.
Mem0 ingest extraction	Mode A with `infer=False`: 359/395	Mode A over `infer=True` ingest: 352/395	Full-rubric pass count	-7 scenarios	97% of the A/B drop remains retrieval-surface.
Dispatch privilege	Graphiti A: 354/395; Mem0 A: 359/395	Graphiti C: 345/395; Mem0 C: 357/395	Full-rubric pass count	-9 and -2 scenarios	The dispatcher is an LLM; the indexed scan stays fixed.

References#

Li, Z., Zhang, H., Wei, C., Lu, P., Nie, P., Lu, Y., Bai, Y., Feng, S., Zhu, H., Zhong, M., Zhang, Y., Xie, J., Choi, Y., Zou, J., Han, J., Chen, W., Lin, J., Jiang, D., & Zhang, Y. (2026). Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction. arXiv:2605.05242.
Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956. Repo: github.com/getzep/graphiti.
Chhikara, P., Khant, D., Aryan, S., Singh, T., & Yadav, D. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413. Site: mem0.ai.
Ji, S., Li, Y., & Hooi, B. (2026). Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents. arXiv:2606.06036.

Limitations#

Requires a query schema. If entities and predicates are not canonicalized, Mode C can dispatch to the wrong coordinate.
Benchmark scores are controlled evaluations; production distributions and corpus noise can widen or shrink the gap.
Structured access still does not solve cascading staleness by itself; it needs the premise walk in the next article.
Mem0's retrieval algorithm continued to evolve through April–May 2026 (blog 1, blog 2) — both predate this article and I had not seen them at write time. The Mode A / Mode B numbers here reflect the version evaluated in the paper, not the later state-key + event_end reranking layer. The next article discusses how that update relates to the benchmark.