16 May 20267 min read1,633 words

#agent#memory#evaluation

When Remembered Facts Outlive Their Evidence

Why reliable agent memory needs to reconstruct not just answers, but the evidence and status those answers carry under change.

A memory system can return a fact that is still in the store and still plausible, while the evidence that made the fact safe to use has changed underneath it.

That is the problem this benchmark studies. Not whether an agent can recall a string. Whether it can reconstruct what it believed, when it believed it, what evidence supported it, and whether that evidence still deserves trust.

The Compliance Version#

Imagine a project assistant that reads building-code updates.

Date	What the memory reads or derives
Jan 1	Towers over 40 floors require .
Jan 8	Aurelia Tower has 42 floors.
Jan 15	Aurelia Tower requires Class F-High review, derived from the Jan 1 rule and Jan 8 tower record.
Jun 1	The building code changes: towers over 50 floors require Class F-High review.

Now ask the memory: Does Aurelia Tower require Class F-High review?

The old decision may still be sitting in memory. It may even be the answer the system would have given at the earlier transaction time. But if the memory answers today without noticing that the rule behind the decision changed, the answer is not audit-safe.

The point is not that the old answer is always false. The point is that a system using the answer should know that the answer's support has changed.

The Dependency-Chain Version#

The paper uses the same pattern in a corporate setting because real-world entity names make it easier to spot when a system answers from its training data instead of from the corpus you actually loaded. Here is the simplified Apple version.

Evidence Chain

The answer survives while its support changes.

Potentially stale

Jan 1

Tim Cook is CEO of Apple

A premise the memory used for later claims.

Jan 8

Maya Patel is chief of staff to the Apple CEO

Derived from the Jan 1 CEO claim.

Jan 15

Maya Patel has signing authority for Apple legal

Derived from the chief-of-staff claim.

Jun 1

Sarah Chen is CEO of Apple from Jun 1

A later update supersedes the old CEO premise.

Jan 1

Tim Cook is CEO of Apple

A premise the memory used for later claims.

derived_from ↓

Jan 8

Maya Patel is chief of staff to the Apple CEO

Derived from the Jan 1 CEO claim.

derived_from ↓

Jan 15

Maya Patel has signing authority for Apple legal

Derived from the chief-of-staff claim.

supersedes ↑ Tim Cook

Jun 1

Sarah Chen is CEO of Apple from Jun 1

A later update supersedes the old CEO premise.

The signing-authority claim is not automatically false. The audit problem is that its support chain now passes through a superseded premise.

The memory has a . That claim depends on a . The chief-of-staff claim depends on an earlier . Later, the CEO premise is superseded.

If the memory only retrieves the signing-authority claim, it can answer:

Yes.

But the audit question is larger:

Yes, according to which evidence, and is that evidence still safe?

Scroll explainer

Watch the same answer change status.

Jan 1

Jan 8-15

Mar 1

Jun 1

Evidence box

Claims arrive as the memory reads and derives them.

Jan 1

Tim Cook is CEO of Apple

Original premise. Later superseded.

Jan 8 -> Jan 15

Maya Patel has signing authority

A downstream claim derived through the CEO premise.

Query

Does Maya Patel have signing authority for Apple legal matters?

Answer: yes - but status changes to POTENTIALLY_STALE.

Jun 1

Sarah Chen is CEO from Jun 1

The update reaches backward through the support chain.

Step 1

The memory stores a premise.

On Jan 1, the memory reads that Tim Cook is CEO of Apple. At this point, there is only one premise in the evidence box.

Step 2

The memory derives downstream facts.

On Jan 8 and Jan 15, the memory stores two claims that depend on the CEO premise: a chief-of-staff relation, then signing authority.

Step 3

A user asks an ordinary question.

On Mar 1, a user asks whether Maya Patel has signing authority. If retrieval stops at the answer node, the memory looks fine.

Step 4

A later update changes the support chain.

On Jun 1, the memory reads that Sarah Chen became CEO. The answer is not automatically false, but the support chain now needs review.

Empty evidence box

The memory starts with no supporting claims for this question.

Step 1

The memory stores a premise.

On Jan 1, the memory reads that Tim Cook is CEO of Apple. At this point, there is only one premise in the evidence box.

Jan 1

Tim Cook is CEO of Apple

The first premise enters memory.

Step 2

The memory derives downstream facts.

On Jan 8 and Jan 15, the memory stores two claims that depend on the CEO premise: a chief-of-staff relation, then signing authority.

Jan 8 -> Jan 15

Signing authority is derived

The downstream answer now depends on the Jan 1 premise.

Step 3

A user asks an ordinary question.

On Mar 1, a user asks whether Maya Patel has signing authority. If retrieval stops at the answer node, the memory looks fine.

Mar 1

The memory answers yes

The direct answer looks fine if retrieval stops here.

Step 4

A later update changes the support chain.

On Jun 1, the memory reads that Sarah Chen became CEO. The answer is not automatically false, but the support chain now needs review.

Jun 1

The support chain changes

The answer should now be marked POTENTIALLY_STALE.

Two Kinds of Time#

Reliable memory under change needs two clocks.

type ClaimID string
type SourceID string
 
type Claim struct {
    Id   ClaimID
    Body string
 
    // Support: other claim IDs this one depends on. A derived claim points
    // to its premises; a leaf premise has none and rests on Sources instead.
    // Example: the "Maya Patel has signing authority" claim derives from the
    // "Maya Patel is chief of staff" claim.
    DerivedFrom []ClaimID
 
    // Source documents this claim was read from. Leaf premises always have
    // at least one. Used to audit the foot of the support chain.
    // Example: a press release, SEC 8-K filing, or board-meeting minutes.
    Sources []SourceID
 
    // Valid time: when did the claim become true about the world?
    // Example: Sarah Chen became CEO from Jun 1.
    ValidStart time.Time
 
    // Valid time: when did the claim stop being true? nil if still current.
    // Example: nil while Sarah Chen is still CEO.
    ValidEnd *time.Time
 
    // Transaction time: when did the memory learn the claim?
    // Example: the memory read that update on Jun 1.
    CreatedAt time.Time
}

Together those fields define a formal bitemporal validity predicate. Using lowercase $t$ for the claim's own timestamps and uppercase $T$ for the query's time points:

$t_{vs}$ — the claim's ValidStart (when it became true in the world).
$t_{ve}$ — the claim's ValidEnd (when it stopped being true; NULL if still current).
$t_t$ — the claim's CreatedAt (when the memory learned it).
$T_v$ — the query's valid-time point (the moment in the world being asked about).
$T_t$ — the query's transaction-time point (the moment in the memory's history being asked about).

The claim is valid at the query point $(T_v, T_t)$ when:

t_{vs} \le T_v \;\land\; (t_{ve}\;\texttt{IS NULL} \;\lor\; t_{ve} > T_v) \;\land\; t_t \le T_t

The first two conjuncts pin the claim in the world's history; the third pins it in the memory's history. Both must hold for the answer at $(T_v, T_t)$ to be safe to use.

Those clocks are different. A correction can arrive late. A policy can be amended after a downstream decision was made. A memory audit has to ask not only what is true now, but what the system should have believed at a given point in its own history.

Temporal Belief Reconstruction#

I call the benchmark task temporal belief reconstruction.

Given a question with valid-time and transaction-time constraints, the memory system should reconstruct three things:

Output	What it means
Answer	The value the memory should return for the queried time.
Evidence	The documents or claims that support that answer.
Epistemic status	Whether the answer is single-source, corroborated, contested, enriched, or potentially stale.

The third output is the easy one to overlook. Most retrieval systems are built to return a value and maybe a citation. But a maintained memory should also say what kind of state the answer is in.

Answer-only memory

Answer: Yes
Evidence: Jan 15 note
Status: UNVERIFIED

Auditable memory

Answer: Yes
Evidence: Jan 15 -> Jan 8 -> Jan 1
Status: POTENTIALLY_STALE
Reason: Jan 1 was superseded on Jun 1

The Status Is Part of the Answer#

The key status in this benchmark is POTENTIALLY_STALE.

It does not mean the answer is false. It means the answer rests on a support chain that now contains a superseded premise. That is a different state from a single-source answer with no known problems.

UNVERIFIED

One source supports the claim, with no known conflict or stale premise.

CORROBORATED

Multiple sources support the same single-valued claim.

CONTESTED

Competing sources support incompatible values.

ENRICHED

A set-valued answer gained additional compatible values.

POTENTIALLY_STALE

The answer may still hold, but a supporting premise has been superseded.

This is why answer-only evaluation misses the issue. If the benchmark only asked whether the system returned "yes" for the signing-authority question, many systems would look better than they are. The failure is visible only when the system must also reconstruct the evidence chain and the status that evidence should carry.

Yona, Geva, and Matias make a structurally parallel argument about parametric models^[8]: most factuality work has expanded what a model knows, but reliability turns on the model's awareness of what it knows — a per-answer faithful-uncertainty signal, not aggregate calibration, and a third option beyond answer or abstain. The same shape applies to retrieved memory. Where they ask whether a parametric model can recognize the limits of its own knowledge, this benchmark asks whether a memory interface can recognize when a stored claim's support has changed. POTENTIALLY_STALE is the structural analogue of that third option: not silence, not a confident answer, but an answer that arrives with a faithful signal that its evidence has shifted.

Self-revising discovery systems make the same contract explicit for long-running scientific agents: durable state should be typed artifacts, provenance, gates, rejected alternatives, and supersessions, not a hidden vector or transcript alone.^[9] This benchmark is a narrower fixed-schema version of that idea. It asks whether a memory answer can carry the provenance and verifier state needed to tell a usable fact from one whose support has moved.

What the Benchmark Tests#

The benchmark grades each scenario on separate axes:

Did the system return the right answer?
Did it return the right citation or evidence chain?
Did it return the right epistemic status?
Did the system finish ingesting the corpus and become queryable in bounded time?

Failing any one of those axes is a failure for that scenario. The separation is the point: a system can be answer-correct and still fail because it cannot explain whether the answer is safe to use.

Recent long-context memory benchmarks like BEAM^[7] exercise overlapping abilities — knowledge update, contradiction resolution, temporal reasoning — but score them on dialogue recall rather than requiring deterministic bitemporal reconstruction with audit-safe epistemic status and transitive dependency-closure checks.

What I Built#

This project is not only a paper argument. I built:

the benchmark corpus
the deterministic scorer
a reference implementation (a roughly fifty-line SQL store on )
the wrapper interface for comparison systems
the result runners
the reproducibility bundle published with the artifact

The important constraint was that scoring should be structural rather than judged: expected answers are derived from the formal time-and-evidence rules and the declared evidence links, never by asking another model whether an answer sounds reasonable.

The next article is about why this happens. The short version is that many memory interfaces expose the answer but not enough of the dependency structure behind the answer. The stored evidence exists at one resolution, while the agent-facing retrieval interface exposes a lower-resolution view. The third article shows the read-path fix and where it stops working.

Benchmark Summary#

Task	Baseline	Proposed contract	Metric	Delta	Caveat
Temporal belief reconstruction	Answer-only memory	Answer + evidence + epistemic status	Four-axis scenario pass	Answer-only passes can still fail full rubric	Requires structural scoring, not LLM judging.
Time handling	Single “last updated” timestamp	Valid-time interval + transaction timestamp	Bitemporal validity predicate	Catches late corrections and retroactive facts	Boundary convention must be `[start, end)`.
Evidence safety	Citation retrieval	Citation plus status (`UNVERIFIED`, `CORROBORATED`, `CONTESTED`, `ENRICHED`, `POTENTIALLY_STALE`)	Epistemic axis	Separates plausible answers from audit-safe answers	Status needs provenance data from the write path.

Limitations#

The contract only works if the ingest path preserves sources, timestamps, and derivation links instead of flattening them away.
Bitemporal storage does not by itself solve retrieval; lower-resolution interfaces can still hide the data.
This series focuses on reliability/status under change, not semantic breadth or user-experience quality.

References#

The full system list and per-row configuration detail live in the next article and the third article. The headline canonical references for the comparator systems:

Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956. Repo: github.com/getzep/graphiti.
Chhikara, P., Khant, D., Aryan, S., Singh, T., & Yadav, D. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413. Site: mem0.ai.
Marković, V., Obradović, L., Hajdu, L., & Pavlović, J. (2025). Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning. arXiv:2505.24478. Repo: github.com/topoteretes/cognee.
TencentDB Agent Memory Team (2026). TencentDB-Agent-Memory: Layered Local Memory for AI Agents. Repo: github.com/Tencent/TencentDB-Agent-Memory.
Honcho (2026). Managed personal-memory service with a Dialectic Agent retrieval surface. Site: honcho.dev.
Supermemory (2026). Memory graph with documented update / extend / derive relation labels. Site: supermemory.ai.
Tavakoli, M., Salemi, A., Ye, C., Abdalla, M., Zamani, H., & Mitchell, J. R. (2026). Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs. ICLR 2026. arXiv:2510.27246. Repo: github.com/mohammadtavakoli78/BEAM.
Yona, G., Geva, M., & Matias, Y. (2026). Hallucinations Undermine Trust; Metacognition is a Way Forward. arXiv:2605.01428.
Wang, F. Y., & Buehler, M. J. (2026). Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence. arXiv:2606.01444.