The memory is right.
The world has changed.
First surfaced in Tandemly Briefing — 2026-05-12.
Researchers built a 1,200-query benchmark to test something most agent evaluations skip entirely: whether an LLM agent can tell when its stored beliefs have been silently invalidated by later context. The best frontier model scored 55.2% across the full test. The dominant failure mode is not forgetting. It is continuing to act on information that was once correct but is no longer.
Agents remember but
don't check.
Memory in production agents is usually evaluated for accuracy at storage time. Whether stored information is still accurate at retrieval time is a different question, and most evaluation frameworks skip it entirely.
An agent that helps you manage your calendar knows, at some point, that you prefer morning meetings. It stored that fact. It will use that fact the next time it schedules something for you. What it probably won't do is ask whether that preference still holds.
This is a structural gap, not a model capability gap. The research community has built excellent benchmarks for testing whether agents can store and retrieve information correctly. It has not built equally good tools for testing whether agents can detect when stored information has expired. The STALE benchmark is an attempt to fill that gap.
The researchers distinguish between two ways a memory can become invalid. The first is explicit conflict: a later message directly contradicts the old belief. "I moved my office to Building B" explicitly updates the stored belief that your office is in Building A. A model that can do basic text comparison should catch this. The second is implicit conflict: the old belief becomes wrong not because anything says so, but because later context implies it. If a new company policy is introduced, your stored preference for in-person Friday meetings may be outdated, but nothing in the new policy says so directly. The model has to infer it. This inference is where production agents consistently fail.
Can frontier LLMs, operating as agents with access to a memory store, reliably detect when a stored belief has been silently invalidated by later information? And can they act appropriately on that detection: updating their behavior, resisting questions that presuppose the stale state, and refusing to propagate outdated facts?
Three ways a memory
can fail you.
STALE is a 1,200-query benchmark that probes three distinct failure modes of agent memory, each corresponding to a different way staleness surfaces in practice.
The benchmark was designed around a clean three-axis structure. Each axis targets a different cognitive demand on the agent. Can it notice? Can it refuse? Can it adapt? Together, the three axes map to the full range of things an agent would need to do correctly to handle stale memory well.
The key methodological choice was separating implicit from explicit conflict within each axis. Explicit conflicts are invalidations that require only surface-level matching: the new text directly contradicts the stored fact. Implicit conflicts require multi-step reasoning: the agent has to infer that the stored fact is now wrong from context that never says so plainly. This distinction is what gives the benchmark its diagnostic sharpness.
In an explicit conflict, the invalidating information lands in a way a keyword search could catch. In an implicit conflict, the invalidation is distributed across context. The model has to hold the stored belief, recognize that the new context changes its truth conditions, and draw the conclusion that the belief no longer applies. Each of those steps is individually straightforward. Together, and across arbitrary content domains, they form a reliable failure surface for today's frontier models.
55.2% is the ceiling.
The best frontier model evaluated reaches 55.2% accuracy across the full STALE benchmark. Most of the failures cluster at the same place: implicit conflict, where the invalidation is never stated directly.
Frontier models do not struggle to retrieve stored information. They struggle to notice when that information has been undermined by what came after. The gap between explicit and implicit conflict performance is where almost all the benchmark's explanatory power lives. A model that handles explicit conflicts but misses implicit ones is a model that will maintain silently outdated state in any context that doesn't spell out the update.
It is not enough for an agent to detect staleness internally. When a user asks a question that presupposes the old state is still true, the agent also needs to refuse to answer on that premise. The benchmark's second axis tests this separately because detection and refusal are not the same skill. An agent can correctly register that a belief is outdated and still answer as though it were current, because the surface form of the question pulls the model toward compliance. Production agents that don't explicitly gate on staleness detection before generating responses will fail premise resistance even when they detect the conflict.
The third axis, implicit policy adaptation, requires the agent to not only detect that a change has occurred but also figure out what to do differently without being asked. This is the hardest of the three and where frontier models score lowest. It is also the most practically important: most real-world staleness isn't surfaced by a user question. It's surfaced when an agent does something it should have updated and no one catches it until the output is already downstream.
The benchmark is designed for controlled evaluation. It cannot fully replicate the diversity of implicit conflicts that arise in production systems, where context windows are longer, memory stores are larger, and invalidation signals can be buried in conversational history. The authors position STALE as an eval template for builders to adapt, not a definitive characterization of the full problem space.
What builders need
to do now.
The action gap here is large. The benchmark exists precisely because production agents are currently running on memories they never check for validity. The paper is designed to be immediately adoptable as an eval template.
Where to go
from here.
Concrete steps for acting on this research.