Kai Hirota

7 min read1,499 words

Metacognition for Agent Memory: From Context to Understanding

Engram explores whether an agent memory can turn short-term information into a coherent, evolving understanding of a domain, while keeping answers grounded in the source material that shaped that understanding.

Most useful AI work is not a one-off lookup. The agent needs to absorb changing information, reconcile it with what it already knows, and answer from a coherent understanding of the domain.

That problem is easy to see inside a company: policies change, customer histories accumulate, product behavior shifts, and internal exceptions pile up. But it is not only a business problem. Any long-running agent has to turn many short-lived pieces of information into durable understanding.

This article is about that conversion. Raw context gives the model temporary access to documents. Retrieval returns relevant fragments. Engram asks whether a memory system can maintain a unified state as new information arrives, old claims are superseded, sources are retracted, and contradictions appear. The goal is not just to remember facts. The goal is to answer from a maintained understanding, while keeping the answer traced to the source claims that shaped it.

From information to understanding#

A previous series asked whether a memory's read interface can see that a stored claim has gone stale: answer-correct, right citation, resting on a premise superseded several hops back. This article moves one level higher.

The question here is not only whether the memory can recall a fact. It is whether the memory can decide how that fact should be used. Is the claim current? Is it contradicted by another source? Did it come from a source that is now retracted? Should the agent use it, use it with a warning, or stop and ask for review?

That is where matters. Yona, Geva, and Matias describe metacognition as the control layer of a tool-using agent.[1] In Engram, that control layer lives in the memory. The memory does not just return facts. It asks what it currently believes, what source claims support that belief, and whether that support is stable enough to answer from.

What Engram maintains#

Engram has three pieces that matter for this article.

A long-term state, not just retrieved snippets. Engram keeps active beliefs, superseded claims, conflicts, quarantined or retracted evidence, and the source claims each answer depends on. The agent should be able to ask from inside a maintained understanding, not from a fresh pile of fragments.

A reflective read path. Engram answers with one of four labels derived from the memory state:

  • ACT: the support is current and uncontested.
  • : the answer remains usable, but a premise was superseded.
  • ABSTAIN: a supporting source was retracted or quarantined.
  • NEED_REVIEW: sources conflict or coverage is incomplete.

The labels say whether the memory's current understanding is stable, stale-but-usable, unsupported, or unresolved.

Answers traced to source. The verifier is a discipline mechanism. It replays the decision from the cited source claims and checks that the label follows from the memory state. That matters because an agent should not only say what it believes. It should be able to show which ingested information caused that belief, and whether the supporting state still holds.

Reflection under pressure#

The first test is deliberately narrow. The memory state already contains a problem: a source was retracted, a premise was superseded, two claims conflict, a derivation rests on a broken premise, or a citation does not resolve. The question is whether the read interface uses that state when it answers.

Each threatened scenario has a clean twin. If a system misses the threat, it fails. If it over-flags the clean twin, it also fails. That pairing matters because a useful memory cannot simply refuse whenever history looks complicated.

This is not a test of whether a system can discover stale information from the world. The threatened state is already labeled. The test isolates the metacognitive step: after short-term information has been ingested, can the memory decide whether that information is stable understanding, a warning, a contradiction, or a reason to stop?

The difference shows up at the interface. Take "is supplier X cleared?", where the clearance was retracted. A raw-context model can read the history, notice the retraction, and still answer "X is cleared, though the memo appears to have been retracted." The warning stays in prose. mem0 often fails the other way: on a clean control where X really is cleared, it retrieves too little and blocks a valid answer.

Action outcomes

Did the unsafe action fire? Was a still-valid action blocked?

silent-unsafe actionover-block (utility tax)raw-contextprose-gate47/120 (0.39)38/180 (0.21)mem0prose-gate6/120 (0.05)113/180 (0.63)Engram (label-gate)label-gate0/1200/180Engram (verifier-gate)verifier-gate0/1200/180
The 300 scenarios repartition by their ground-truth decision: 120 must be gated (silent-unsafe denominator) and 180 should proceed, 30 of them with a warning (over-block denominator). The prose-gated comparators fail in opposite directions: raw-context fires unsafe actions (47/120), mem0 over-blocks still-valid ones (113/180). Engram executes zero unsafe actions at zero utility cost. Verifier-gate equals label-gate here because decision correctness is 300/300, so no permissive-but-unsound case arises; its value is the soundness guarantee of the falsification test.

Reflective memory examples

How the answers differ by memory threat.

This has a usable successor value, but it should not read as clean memory. The key difference is whether the system carries the stale predecessor into the answer.

Which office is Raj Patel based in?

Archived value

"Raj Patel's office was Seattle."

Successor value

"Raj Patel's office is Austin, but the memory state carries a warning path."

Expected: Use Austin only with a warning that the answer rests on a superseded predecessor.
Engram
caught
"The previously recorded office is archived and must not be relied upon. The replacement claim is Austin, but it requires a warning before use."

Keeps the successor and the stale predecessor in the same decision.

raw-context
caught
"The most recent record indicates Austin, though I cannot confirm with certainty that these records pertain to Raj Patel."

Surfaces a caveat in prose.

mem0
missed
"No memories were retrieved that contain information about Raj Patel or their office location."

Misses both the stale value and the successor.

Engram reads the flagged state through its reflective interface, emits a label tied to source claims, and lets a separate replay verifier re-derive that label offline. It passes 270 of 300 traces and rejects 100% of traces broken on purpose, including an unsafe label quietly relaxed to ACT.

Read those counts as an interface contrast, not as a deployment score. The corpus is synthetic and built around Engram's own representation. The useful question is narrower: does the interface turn stored information into disciplined understanding, or does it leave the warning in prose, miss the evidence, or block too broadly?

Does understanding survive scale? BEAM-100K#

The second test asks whether this maintained state still works for ordinary long-horizon QA. A memory system that reflects well but loses usefulness would not be very interesting.

BEAM-100K is useful because it is built from long, 100K-token conversations and scores different memory abilities. It is not a complete benchmark for domain understanding, but the breakdown is relevant: contradiction resolution, temporal reasoning, knowledge update, abstention, information extraction, and preference following are all pieces of whether understanding survives a growing history.

With the same answering model for the controlled Engram, raw-context, and mem0 OSS runs, Engram reaches a rubric mean of 0.696, compared with raw-context at 0.665 and mem0 OSS at 0.662. The small overall lead is not the main point. At 100K tokens the whole history still fits in a prompt, so a model reading all of it is close to the ceiling. Matching that baseline while working from a compressed, source-traced memory state is what matters.

Engram also leads on the abilities closest to evolving understanding: contradiction resolution and temporal reasoning. Those scores depend on tracking conflict, supersession, and time, not just producing fluent answers. On knowledge update, mem0's answer-style reconciliation edges it in this BEAM run, a gap that closes on the cleaner LongMemEval test below.

The figures include mem0 OSS as the local controlled memory baseline. They also include published BEAM-100K results from Hindsight and Honcho as external context, not as a controlled leaderboard. Honcho reports 0.630 on BEAM-100K using its own benchmark harness and model stack, and publishes results across larger BEAM scales where context-stuffing is no longer possible.[5] The fair reading is not "Engram beats Honcho" or the reverse. mem0 OSS and Honcho are strong memory-retrieval comparators in different comparison modes. Engram is testing a narrower mechanism: reflective belief revision over a state that can be traced back to source claims.

BEAM-100K

The shape of utility across the abilities that matter.

All ten BEAM abilities; the four evolving-understanding abilities Engram targets are bold.

0.250.500.751.00PreferenceFollowingInformationExtractionAbstentionContradictionResolutionTemporalReasoningKnowledgeUpdateInstructionFollowingMulti-SessionReasoningEventOrderingSummarization
click a system to show/hide · hover a dimension to rank systems
Rubric mean (0–1) for all ten abilities; the avg figures are over all ten. The four bold abilities (abstention, supersession, conflict, valid-time) are where evolving understanding is easiest to inspect: they ask whether a memory can know when to stop, update, reconcile, and time-scope what it knows. Engram leads those comparator slices, while Hindsight's frontier answer model wins the recall and generation abilities and the overall. Engram and mem0 OSS are local controlled runs on the same harness. Engram: engram-l5-evolve, Claude Sonnet 4.6 agent, gpt-4.1-mini judge (299/400passed across all ten). mem0 OSS: local SDK all-ability run, Claude Sonnet 4.6 agent, gpt-4.1-mini judge. Hindsight and Honcho are each system's own published BEAM-100K numbers. Hindsight from their benchmark site (single-query, gemini-3.1-pro-preview answer model), Honcho from Plastic Labs' benchmark post and data snapshot. Different answer models, judges, and harnesses per series, so this is indicative context, not a controlled leaderboard. The view scopes the chart to abilities Engram is built for, where tracking supersession and conflict, not raw answer quality, is what scores.

BEAM-100K · by ability

Each ability, system by system.

The four abilities this work targets: knowledge-update, contradiction-resolution, temporal-reasoning, and abstention.

abstentionEngram0.812mem0 OSS0.813Hindsight0.975Honcho0.363contradiction resolutionEngram0.750mem0 OSS0.709Hindsight0.616Honcho0.706temporal reasoningEngram0.781mem0 OSS0.550Hindsight0.575Honcho0.644knowledge updateEngram0.688mem0 OSS0.763Hindsight0.588Honcho0.4630.00.51.0
Engram0.758avgmem0 OSS0.709avgHindsight0.689avgHoncho0.544avg
The same BEAM-100K rubric means as the radar, but scoped to the four abilities closest to evolving understanding: belief revision (supersession, conflict, valid-time) plus abstention. Each chart shows one ability so you can read down its systems (avg figures are over these four). Engram and mem0 OSS are local controlled runs on the same harness. Hindsight and Honcho are each system's own published BEAM-100K numbers on different model stacks and harnesses, so those rows are indicative context, not a controlled leaderboard.

BEAM-100K examples

How the answers differ by ability.

The question asks for a yes or no, but the useful behavior is preserving the conflict. Engram and raw-context show both claims; mem0 collapses to a confident yes.

Have I submitted my cover letter to Greg for review?

Earlier source

"I've submitted my cover letter to Greg for review on April 10 and I'm anxious to get his detailed feedback by April 14."

Later source

"I've never submitted my cover letter to Greg for review. Should I ask him to take a look before I finalize it?"

Gold: I notice you've mentioned contradictory information about this. You said you have submitted your cover letter to Greg for review, but you also mentioned that you have never submitted it. Which statement is correct?
Engram
pass
"There are two conflicting signals: one says the cover letter was submitted to Greg on April 10, while another says it had not yet been submitted."

Shows the conflict, then resolves toward the dated submission record.

raw-context
pass
"There are contradictory statements: one turn says it was submitted on April 10, while a later turn says it was never submitted."

Mostly catches the conflict from the transcript.

mem0
miss
"Yes, according to the memories, Darryl submitted his cover letter to Greg for review on April 10, 2024."

Answers confidently without preserving the contradiction.

LongMemEval gives a cleaner look at knowledge update. In that category, the updates are clear rather than buried in answer phrasing. Engram scores 0.936, ahead of the local raw-context row at 0.885 and local mem0 row at 0.782. The maintained state does not cost general QA, and where belief revision is the explicit task it is the strongest of the three local rows.

The published systems are useful context. Honcho reports 0.949 on the same LongMemEval knowledge-update category. mem0 reports 0.936 for its newer published system.[6] Zep also reports 0.936 on that category.[7] The right read is that Engram is in the same range as strong published memory systems while keeping the reflective state and source-trace interface in view.

LongMemEval / knowledge update

Clean updates favor the maintained state.

Accuracy on the LongMemEval knowledge-update slice, N=78, judged by gpt-4o. Engram M1 tests the evolving local state. Honcho, mem0 OSS, and Zep are published results from their own benchmark reports, included as external context.

What this implies#

Engram is not mainly a citation system, and it is not mainly a safety layer. It is a memory layer trying to move an agent from short-term information access toward durable understanding.

The metacognition idea moves from inside the model out to the memory layer. The memory can reflect on what it believes, whether that belief is current, whether it conflicts with other source-backed claims, and whether the answer should be used, warned about, or held for review.

The source traces and verifier matter because they discipline that understanding. They keep the agent answering from the maintained state rather than from a fresh fragment or a fluent reconstruction. On the controlled probe, Engram regulates stale, conflicted, and unsupported memory states while holding its own on ordinary long-horizon QA.

That is the core claim: memory should not just retrieve information. It should help the agent convert information into a coherent understanding that can keep changing without losing its shape.

References#

  1. Yona, G., Geva, M., & Matias, Y. (2026). Hallucinations Undermine Trust; Metacognition is a Way Forward. arXiv:2605.01428.
  2. Wu, D., et al. (2024). LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. arXiv:2410.10813.
  3. Tavakoli, M., et al. (2026). Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs. ICLR 2026. arXiv:2510.27246.
  4. Latimer, C., et al. (2025). Hindsight Is 20/20: Building Agent Memory That Retains, Recalls, and Reflects. arXiv:2512.12818.
  5. McCormick, B., & Leer, C. (2025). Benchmarking Honcho. Plastic Labs. plasticlabs.ai/blog/research/Benchmarking-Honcho. Data: github.com/plastic-labs/honcho-benchmarks/tree/main/a1d689b.
  6. Mem0. Benchmarking Mem0's token-efficient memory algorithm. mem0.ai/research.
  7. Zep. Benchmarks for agent memory. getzep.com/research.