7 min read1,544 words
When Remembered Facts Outlive Their Evidence
Why reliable agent memory needs to reconstruct not just answers, but the evidence and status those answers carry under change.
A memory system can return a fact that is still in the store and still plausible, while the evidence that made the fact safe to use has changed underneath it.
That is the problem this benchmark studies. Not whether an agent can recall a string. Whether it can reconstruct what it believed, when it believed it, what evidence supported it, and whether that evidence still deserves trust.
The Compliance Version#
Imagine a project assistant that reads building-code updates.
| Date | What the memory reads or derives |
|---|---|
| Jan 1 | Towers over 40 floors require . |
| Jan 8 | Aurelia Tower has 42 floors. |
| Jan 15 | Aurelia Tower requires Class F-High review, derived from the Jan 1 rule and Jan 8 tower record. |
| Jun 1 | The building code changes: towers over 50 floors require Class F-High review. |
Now ask the memory: Does Aurelia Tower require Class F-High review?
The old decision may still be sitting in memory. It may even be the answer the system would have given at the earlier transaction time. But if the memory answers today without noticing that the rule behind the decision changed, the answer is not audit-safe.
The point is not that the old answer is always false. The point is that a system using the answer should know that the answer's support has changed.
The Dependency-Chain Version#
The paper uses the same pattern in a corporate setting because real-world entity names make it easier to spot when a system answers from its training data instead of from the corpus you actually loaded. Here is the simplified Apple version.
Evidence Chain
The answer survives while its support changes.
Jan 1
Tim Cook is CEO of Apple
A premise the memory used for later claims.
Jan 8
Maya Patel is chief of staff to the Apple CEO
Derived from the Jan 1 CEO claim.
Jan 15
Maya Patel has signing authority for Apple legal
Derived from the chief-of-staff claim.
Jun 1
Sarah Chen is CEO of Apple from Jun 1
A later update supersedes the old CEO premise.
Jan 1
Tim Cook is CEO of Apple
A premise the memory used for later claims.
Jan 8
Maya Patel is chief of staff to the Apple CEO
Derived from the Jan 1 CEO claim.
Jan 15
Maya Patel has signing authority for Apple legal
Derived from the chief-of-staff claim.
Jun 1
Sarah Chen is CEO of Apple from Jun 1
A later update supersedes the old CEO premise.
The memory has a . That claim depends on a . The chief-of-staff claim depends on an earlier . Later, the CEO premise is superseded.
If the memory only retrieves the signing-authority claim, it can answer:
Yes.But the audit question is larger:
Yes, according to which evidence, and is that evidence still safe?Scroll explainer
Watch the same answer change status.
Jan 1
Jan 8-15
Mar 1
Jun 1
Evidence box
Claims arrive as the memory reads and derives them.
Jan 1
Tim Cook is CEO of Apple
Original premise. Later superseded.
Jan 8 -> Jan 15
Maya Patel has signing authority
A downstream claim derived through the CEO premise.
Query
Does Maya Patel have signing authority for Apple legal matters?
Answer: yes - but status changes to POTENTIALLY_STALE.
Jun 1
Sarah Chen is CEO from Jun 1
The update reaches backward through the support chain.
Step 1
The memory stores a premise.
On Jan 1, the memory reads that Tim Cook is CEO of Apple. At this point, there is only one premise in the evidence box.
Step 2
The memory derives downstream facts.
On Jan 8 and Jan 15, the memory stores two claims that depend on the CEO premise: a chief-of-staff relation, then signing authority.
Step 3
A user asks an ordinary question.
On Mar 1, a user asks whether Maya Patel has signing authority. If retrieval stops at the answer node, the memory looks fine.
Step 4
A later update changes the support chain.
On Jun 1, the memory reads that Sarah Chen became CEO. The answer is not automatically false, but the support chain now needs review.
Empty evidence box
The memory starts with no supporting claims for this question.
Step 1
The memory stores a premise.
On Jan 1, the memory reads that Tim Cook is CEO of Apple. At this point, there is only one premise in the evidence box.
Jan 1
Tim Cook is CEO of Apple
The first premise enters memory.
Step 2
The memory derives downstream facts.
On Jan 8 and Jan 15, the memory stores two claims that depend on the CEO premise: a chief-of-staff relation, then signing authority.
Jan 8 -> Jan 15
Signing authority is derived
The downstream answer now depends on the Jan 1 premise.
Step 3
A user asks an ordinary question.
On Mar 1, a user asks whether Maya Patel has signing authority. If retrieval stops at the answer node, the memory looks fine.
Mar 1
The memory answers yes
The direct answer looks fine if retrieval stops here.
Step 4
A later update changes the support chain.
On Jun 1, the memory reads that Sarah Chen became CEO. The answer is not automatically false, but the support chain now needs review.
Jun 1
The support chain changes
The answer should now be marked POTENTIALLY_STALE.
Two Kinds of Time#
Reliable memory under change needs two clocks.
type ClaimID string
type SourceID string
type Claim struct {
Id ClaimID
Body string
// Support: other claim IDs this one depends on. A derived claim points
// to its premises; a leaf premise has none and rests on Sources instead.
// Example: the "Maya Patel has signing authority" claim derives from the
// "Maya Patel is chief of staff" claim.
DerivedFrom []ClaimID
// Source documents this claim was read from. Leaf premises always have
// at least one. Used to audit the foot of the support chain.
// Example: a press release, SEC 8-K filing, or board-meeting minutes.
Sources []SourceID
// Valid time: when did the claim become true about the world?
// Example: Sarah Chen became CEO from Jun 1.
ValidStart time.Time
// Valid time: when did the claim stop being true? nil if still current.
// Example: nil while Sarah Chen is still CEO.
ValidEnd *time.Time
// Transaction time: when did the memory learn the claim?
// Example: the memory read that update on Jun 1.
CreatedAt time.Time
}Together those fields define a formal bitemporal validity predicate. Using lowercase for the claim's own timestamps and uppercase for the query's time points:
- — the claim's
ValidStart(when it became true in the world). - — the claim's
ValidEnd(when it stopped being true;NULLif still current). - — the claim's
CreatedAt(when the memory learned it). - — the query's valid-time point (the moment in the world being asked about).
- — the query's transaction-time point (the moment in the memory's history being asked about).
The claim is valid at the query point when:
The first two conjuncts pin the claim in the world's history; the third pins it in the memory's history. Both must hold for the answer at to be safe to use.
Those clocks are different. A correction can arrive late. A policy can be amended after a downstream decision was made. A memory audit has to ask not only what is true now, but what the system should have believed at a given point in its own history.
Temporal Belief Reconstruction#
I call the benchmark task temporal belief reconstruction.
Given a question with valid-time and transaction-time constraints, the memory system should reconstruct three things:
| Output | What it means |
|---|---|
| Answer | The value the memory should return for the queried time. |
| Evidence | The documents or claims that support that answer. |
| Epistemic status | Whether the answer is single-source, corroborated, contested, enriched, or potentially stale. |
The third output is the easy one to overlook. Most retrieval systems are built to return a value and maybe a citation. But a maintained memory should also say what kind of state the answer is in.
Answer-only memory
Answer: Yes Evidence: Jan 15 note Status: UNVERIFIED
Auditable memory
Answer: Yes Evidence: Jan 15 -> Jan 8 -> Jan 1 Status: POTENTIALLY_STALE Reason: Jan 1 was superseded on Jun 1
The Status Is Part of the Answer#
The key status in this benchmark is POTENTIALLY_STALE.
It does not mean the answer is false. It means the answer rests on a support chain that now contains a superseded premise. That is a different state from a single-source answer with no known problems.
UNVERIFIED
One source supports the claim, with no known conflict or stale premise.
CORROBORATED
Multiple sources support the same single-valued claim.
CONTESTED
Competing sources support incompatible values.
ENRICHED
A set-valued answer gained additional compatible values.
POTENTIALLY_STALE
The answer may still hold, but a supporting premise has been superseded.
This is why answer-only evaluation misses the issue. If the benchmark only asked whether the system returned "yes" for the signing-authority question, many systems would look better than they are. The failure is visible only when the system must also reconstruct the evidence chain and the status that evidence should carry.
Yona, Geva, and Matias make a structurally parallel argument about parametric
models[8]: most factuality work has
expanded what a model knows, but reliability turns on the model's
awareness of what it knows — a per-answer faithful-uncertainty signal,
not aggregate calibration, and a third option beyond answer or abstain. The
same shape applies to retrieved memory. Where they ask whether a parametric
model can recognize the limits of its own knowledge, this benchmark asks
whether a memory interface can recognize when a stored claim's support
has changed. POTENTIALLY_STALE is the structural analogue of that third
option: not silence, not a confident answer, but an answer that arrives with a
faithful signal that its evidence has shifted.
What the Benchmark Tests#
The benchmark grades each scenario on separate axes:
- Did the system return the right answer?
- Did it return the right citation or evidence chain?
- Did it return the right epistemic status?
- Did the system finish ingesting the corpus and become queryable in bounded time?
Failing any one of those axes is a failure for that scenario. The separation is the point: a system can be answer-correct and still fail because it cannot explain whether the answer is safe to use.
Recent long-context memory benchmarks like BEAM[7] exercise overlapping abilities — knowledge update, contradiction resolution, temporal reasoning — but score them on dialogue recall rather than requiring deterministic bitemporal reconstruction with audit-safe epistemic status and transitive dependency-closure checks.
What I Built#
This project is not only a paper argument. I built:
- the benchmark corpus
- the deterministic scorer
- a reference implementation (a roughly fifty-line SQL store on )
- the wrapper interface for comparison systems
- the result runners
- the reproducibility bundle published with the artifact
The important constraint was that scoring should be structural rather than judged: expected answers are derived from the formal time-and-evidence rules and the declared evidence links, never by asking another model whether an answer sounds reasonable.
The next article is about why this happens. The short version is that many memory interfaces expose the answer but not enough of the dependency structure behind the answer. The stored evidence exists at one resolution, while the agent-facing retrieval interface exposes a lower-resolution view. The third article shows the read-path fix and where it stops working.
Benchmark Summary#
| Task | Baseline | Proposed contract | Metric | Delta | Caveat |
|---|---|---|---|---|---|
| Temporal belief reconstruction | Answer-only memory | Answer + evidence + epistemic status | Four-axis scenario pass | Answer-only passes can still fail full rubric | Requires structural scoring, not LLM judging. |
| Time handling | Single “last updated” timestamp | Valid-time interval + transaction timestamp | Bitemporal validity predicate | Catches late corrections and retroactive facts | Boundary convention must be [start, end). |
| Evidence safety | Citation retrieval | Citation plus status (UNVERIFIED, CORROBORATED, CONTESTED, ENRICHED, POTENTIALLY_STALE) | Epistemic axis | Separates plausible answers from audit-safe answers | Status needs provenance data from the write path. |
Limitations#
- The contract only works if the ingest path preserves sources, timestamps, and derivation links instead of flattening them away.
- Bitemporal storage does not by itself solve retrieval; lower-resolution interfaces can still hide the data.
- This series focuses on reliability/status under change, not semantic breadth or user-experience quality.
References#
The full system list and per-row configuration detail live in the next article and the third article. The headline canonical references for the comparator systems:
- Rasmussen, P., Paliychuk, P., Beauvais, T., Ryan, J., & Chalef, D. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956. Repo: github.com/getzep/graphiti.
- Chhikara, P., Khant, D., Aryan, S., Singh, T., & Yadav, D. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413. Site: mem0.ai.
- Marković, V., Obradović, L., Hajdu, L., & Pavlović, J. (2025). Optimizing the Interface Between Knowledge Graphs and LLMs for Complex Reasoning. arXiv:2505.24478. Repo: github.com/topoteretes/cognee.
- TencentDB Agent Memory Team (2026). TencentDB-Agent-Memory: Layered Local Memory for AI Agents. Repo: github.com/Tencent/TencentDB-Agent-Memory.
- Honcho (2026). Managed personal-memory service with a Dialectic Agent retrieval surface. Site: honcho.dev.
- Supermemory (2026). Memory graph with documented
update/extend/deriverelation labels. Site: supermemory.ai. - Tavakoli, M., Salemi, A., Ye, C., Abdalla, M., Zamani, H., & Mitchell, J. R. (2026). Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs. ICLR 2026. arXiv:2510.27246. Repo: github.com/mohammadtavakoli78/BEAM.
- Yona, G., Geva, M., & Matias, Y. (2026). Hallucinations Undermine Trust; Metacognition is a Way Forward. arXiv:2605.01428.