Memory for the model
no retraining required.
Researchers at declare-lab built a compact online memory mechanism that plugs into any frozen LLM and compresses past context into a tiny state matrix, updated token by token using a learning rule borrowed from classical neural memory theory. On long-context agent benchmarks, it outperforms both the base model and stronger retrieval baselines. The backbone stays frozen throughout.
First surfaced in Tandemly Briefing — 2026-05-23.
Context windows end.
Conversations don't.
Every LLM has a hard limit on how much it can see at once. The common workarounds each carry costs that compound at scale.
A language model's context window is the most important number in its specification. It determines how much of a conversation, document, or task history the model can hold in mind while generating a response. When that window fills up, early information falls out. In an agent context, that can mean the model forgets what tool it called three steps ago, what the user originally asked for, or what constraints it was supposed to respect.
The standard fix is retrieval-augmented generation, or RAG: store text in an index, and retrieve relevant chunks when they're needed. RAG works well when the right chunks are easy to identify and the query is specific enough to find them. It works less well when memory needs are diffuse, when context clues are subtle, or when the latency and infrastructure cost of maintaining an external index is prohibitive.
The other fix is to just extend the context window by training the model on longer sequences. That's expensive, it requires access to the base model's weights, and even extended-context models often fail to attend reliably to very early input once the window grows long. A 128k-token context window doesn't mean the model uses all 128k tokens equally well.
There's a third path, less explored in recent large-model work: learned associative memory. Instead of storing text verbatim or training a new model, compress what has been seen into a compact, updateable state. The challenge has always been integrating this idea into a transformer without retraining the entire backbone from scratch.
Can a small trainable module added to a frozen transformer provide useful online memory, updating in real time at inference, without touching the backbone weights? And how much does it actually help on tasks where memory matters?
A tiny state matrix
updates every token.
The mechanism has two moving parts: a write operation that compresses incoming context into a fixed-size state, and a read operation that corrects the model's attention using what's been stored. The backbone never changes.
The memory state is a small matrix, just 8 by 8 in the default configuration. As each new token arrives, the module computes how much the current input differs from what the state already knows, and adjusts the state to close that gap. This is the delta rule: update proportional to the prediction error. It's the same principle that underlies classical Hopfield networks and more recent linear attention models, applied here at the granularity of individual transformer decode steps.
Reading from the state is the other half. At each decode step, the module generates a low-rank correction to the model's key and value matrices, the part of self-attention that determines what information the model retrieves from its own activations. The correction is applied via LoRA-style adapters of rank 8 on the Q and O projections. The frozen backbone sees a slightly different attention landscape, biased toward whatever the memory state has accumulated.
Three write strategies were tested. TSW updates the state one token at a time. SSW summarizes at the segment level. MSW uses multiple write steps per segment. All three keep the state fixed-size regardless of how much input has been processed. Only the adapter weights are trained; a supervised fine-tuning pass teaches them to write and read in ways that help the downstream tasks. Once trained, the adapter runs at inference time with no further updates to itself: it's the state matrix that updates online, not the adapter parameters.
Online means the state updates at inference time as tokens arrive, not in a preprocessing step before the model sees the input. There's no separate retrieval call and no index to build. The memory accumulates continuously as the model reads, and corrects continuously as the model writes. The closest analogy is a running summary that rewrites itself in place rather than appending to a log.
Memory-heavy benchmarks
show the largest gains.
Results are consistent across three model families. The gains are largest on tasks that require remembering facts across long spans of context, and smallest on general-purpose tasks where the backbone was already adequate.
The two benchmarks where δ-mem improved most are the two most demanding of sustained context recall. MemoryAgentBench requires an agent to use information from earlier in the task to complete later steps correctly. LoCoMo tests long-term conversational memory across many turns. On both, the 8×8 state provided enough compressed signal to move performance meaningfully. On general-purpose tasks like HotpotQA, IFEval, and GPQA Diamond, scores stayed close to the backbone baseline: the adapter didn't hurt, but there wasn't much for it to do.
An 8 by 8 matrix has 64 entries. The fact that this is enough to provide a 31% lift on MemoryAgentBench is the result worth sitting with. The state doesn't store text; it stores a compressed representation learned by the adapter to be useful for the attention correction. Whether that representation generalizes well to tasks outside the training distribution is an open question, but the results on held-out benchmarks suggest at least some transferability.
Experiments ran on Qwen3-4B, Qwen3-8B, and SmolLM3-3B. The direction of improvement held across all three, though the magnitude varied. This matters: it's evidence that the mechanism isn't doing something model-specific that happens to work on one architecture. The adapter learns a writing and reading strategy that the training procedure can generalize across different backbone sizes.
The benchmarks tested are specialized for memory tasks. Gains on production agentic workloads would need separate validation. The approach requires GPU hardware with bf16 support and FlashAttention: CPU-only inference is not supported. The adapter itself requires a supervised fine-tuning pass on memory-relevant data, so the "no retraining" claim applies to the backbone specifically. The state size was fixed at 8×8 for evaluation; different configurations may perform differently. The paper does not test adversarial inputs or settings where the delta-rule updates might be exploited or destabilized.
What this means
for builders.
The mechanism is at an early stage, but the framing is immediately useful. Online associative memory is a distinct architectural option from RAG and context extension, with its own tradeoff profile.
Where to go
from here.
If you want to go deeper on the mechanism or run the benchmarks yourself.