Agent Memory · Long-Context LLMs

Memory for the model
no retraining required.

Researchers at declare-lab built a compact online memory mechanism that plugs into any frozen LLM and compresses past context into a tiny state matrix, updated token by token using a learning rule borrowed from classical neural memory theory. On long-context agent benchmarks, it outperforms both the base model and stronger retrieval baselines. The backbone stays frozen throughout.

Core concept

Online memory: instead of storing raw text or building an index, compress past context into a fixed-size state that continuously updates at inference time and corrects attention at each decode step.

First surfaced in Tandemly Briefing — 2026-05-23.

scroll to explore

01The problem

Context windows end.
Conversations don't.

Every LLM has a hard limit on how much it can see at once. The common workarounds each carry costs that compound at scale.

A language model's context window is the most important number in its specification. It determines how much of a conversation, document, or task history the model can hold in mind while generating a response. When that window fills up, early information falls out. In an agent context, that can mean the model forgets what tool it called three steps ago, what the user originally asked for, or what constraints it was supposed to respect.

The standard fix is retrieval-augmented generation, or RAG: store text in an index, and retrieve relevant chunks when they're needed. RAG works well when the right chunks are easy to identify and the query is specific enough to find them. It works less well when memory needs are diffuse, when context clues are subtle, or when the latency and infrastructure cost of maintaining an external index is prohibitive.

The other fix is to just extend the context window by training the model on longer sequences. That's expensive, it requires access to the base model's weights, and even extended-context models often fail to attend reliably to very early input once the window grows long. A 128k-token context window doesn't mean the model uses all 128k tokens equally well.

There's a third path, less explored in recent large-model work: learned associative memory. Instead of storing text verbatim or training a new model, compress what has been seen into a compact, updateable state. The challenge has always been integrating this idea into a transformer without retraining the entire backbone from scratch.

The question this paper asks

Can a small trainable module added to a frozen transformer provide useful online memory, updating in real time at inference, without touching the backbone weights? And how much does it actually help on tasks where memory matters?

02The experiment

A tiny state matrix
updates every token.

The mechanism has two moving parts: a write operation that compresses incoming context into a fixed-size state, and a read operation that corrects the model's attention using what's been stored. The backbone never changes.

The memory state is a small matrix, just 8 by 8 in the default configuration. As each new token arrives, the module computes how much the current input differs from what the state already knows, and adjusts the state to close that gap. This is the delta rule: update proportional to the prediction error. It's the same principle that underlies classical Hopfield networks and more recent linear attention models, applied here at the granularity of individual transformer decode steps.

Reading from the state is the other half. At each decode step, the module generates a low-rank correction to the model's key and value matrices, the part of self-attention that determines what information the model retrieves from its own activations. The correction is applied via LoRA-style adapters of rank 8 on the Q and O projections. The frozen backbone sees a slightly different attention landscape, biased toward whatever the memory state has accumulated.

Three write strategies were tested. TSW updates the state one token at a time. SSW summarizes at the segment level. MSW uses multiple write steps per segment. All three keep the state fixed-size regardless of how much input has been processed. Only the adapter weights are trained; a supervised fine-tuning pass teaches them to write and read in ways that help the downstream tasks. Once trained, the adapter runs at inference time with no further updates to itself: it's the state matrix that updates online, not the adapter parameters.

What "online" means here

Online means the state updates at inference time as tokens arrive, not in a preprocessing step before the model sees the input. There's no separate retrieval call and no index to build. The memory accumulates continuously as the model reads, and corrects continuously as the model writes. The closest analogy is a running summary that rewrites itself in place rather than appending to a log.

Existing approaches

RAG stores raw text, requires an index, retrieves at query time. Context extension retrains the backbone to see more tokens. Both require significant infrastructure or compute investment before they start working.

What δ-mem does

Compresses context into a state matrix at inference time, updates continuously via delta-rule learning, injects corrections via LoRA adapters. The frozen backbone is never touched. No index, no retrieval call, no retraining.

03Findings

Memory-heavy benchmarks
show the largest gains.

Results are consistent across three model families. The gains are largest on tasks that require remembering facts across long spans of context, and smallest on general-purpose tasks where the backbone was already adequate.

MemoryAgentBench

1.31×

vs frozen backbone

LoCoMo

1.20×

vs frozen backbone

vs best baseline

1.15×

average across benchmarks

Finding 1: Gains scale with memory demand

The two benchmarks where δ-mem improved most are the two most demanding of sustained context recall. MemoryAgentBench requires an agent to use information from earlier in the task to complete later steps correctly. LoCoMo tests long-term conversational memory across many turns. On both, the 8×8 state provided enough compressed signal to move performance meaningfully. On general-purpose tasks like HotpotQA, IFEval, and GPQA Diamond, scores stayed close to the backbone baseline: the adapter didn't hurt, but there wasn't much for it to do.

Finding 2: The state size is surprisingly small

An 8 by 8 matrix has 64 entries. The fact that this is enough to provide a 31% lift on MemoryAgentBench is the result worth sitting with. The state doesn't store text; it stores a compressed representation learned by the adapter to be useful for the attention correction. Whether that representation generalizes well to tasks outside the training distribution is an open question, but the results on held-out benchmarks suggest at least some transferability.

Finding 3: Three model families, consistent direction

Experiments ran on Qwen3-4B, Qwen3-8B, and SmolLM3-3B. The direction of improvement held across all three, though the magnitude varied. This matters: it's evidence that the mechanism isn't doing something model-specific that happens to work on one architecture. The adapter learns a writing and reading strategy that the training procedure can generalize across different backbone sizes.

Scope and limitations

The benchmarks tested are specialized for memory tasks. Gains on production agentic workloads would need separate validation. The approach requires GPU hardware with bf16 support and FlashAttention: CPU-only inference is not supported. The adapter itself requires a supervised fine-tuning pass on memory-relevant data, so the "no retraining" claim applies to the backbone specifically. The state size was fixed at 8×8 for evaluation; different configurations may perform differently. The paper does not test adversarial inputs or settings where the delta-rule updates might be exploited or destabilized.

04Practical takeaways

What this means
for builders.

The mechanism is at an early stage, but the framing is immediately useful. Online associative memory is a distinct architectural option from RAG and context extension, with its own tradeoff profile.

For teams hitting context-window limits in agent pipelines

If your agent is dropping important context from earlier steps, and RAG retrieval is either too slow, too brittle, or too expensive to maintain, δ-mem is worth understanding as a complementary option. It doesn't replace RAG for fact lookup from a large corpus, but it addresses a different problem: keeping the model oriented to what happened a few thousand tokens ago within a single session.

For ML engineers evaluating memory architectures

The 1.31× gain on MemoryAgentBench positions δ-mem clearly in the memory-augmentation landscape. Its primary advantage over retrieval-based baselines is that it doesn't require building an index or running a retrieval step: the memory is always in state, not in storage. Its limitation is that the compressed state may not retain verbatim facts as reliably as a corpus of stored chunks. Both matter depending on your task.

For anyone sizing up the hardware requirements

The current implementation requires GPU with bf16 support and FlashAttention. This rules out lightweight deployment on CPU-only servers or inference hardware that doesn't support these formats. If your inference stack can't meet these requirements, monitor for future work: the mechanism itself has no fundamental dependency on these, and more efficient implementations are plausible.

On the broader memory architecture question

This paper sits in a cluster of recent work on long-context LLM memory. Context-window extension, retrieval augmentation, and online associative memory are three different answers to the same underlying problem. They compose. An agent could use δ-mem for within-session continuity, RAG for cross-session fact retrieval, and a longer context window for immediate working memory. The right mix depends on your latency, cost, and accuracy requirements.

05Further exploration

Where to go
from here.

If you want to go deeper on the mechanism or run the benchmarks yourself.

Read the paper

Lei, J., Zhang, D., Li, J., Wang, W., Fan, K., Liu, X., Liu, Q., Ma, X., Chen, B., & Poria, S. (2026). δ-mem: Efficient Online Memory for Large Language Models. declare-lab, Nanyang Technological University. arXiv:2605.12357.

Try the code

The official implementation is at github.com/declare-lab/delta-Mem. Supports Qwen3-4B, Qwen3-8B, and SmolLM3-3B. GPU with bf16 and FlashAttention required.

Run MemoryAgentBench on your own system first

Before adopting, establish a baseline on MemoryAgentBench with your current agent stack. If your agent is already scoring well, the gain from δ-mem may be smaller than 1.31×. The benchmark directly measures the kind of multi-step context retention the mechanism is designed to help with.

Read the delta rule and linear attention background

The Hopfield network delta rule and linear attention are the theoretical building blocks here. For linear attention context: Katharopoulos et al. (2020), "Transformers are RNNs," ICML 2020. For modern associative memory applied to transformers: Ramsauer et al. (2021), "Hopfield Networks is All You Need," ICLR 2021.

Compare against the memory architecture cluster

For context on the memory design space, compare with the STALE benchmark (which measures whether agents detect stale memory) and the GEM framework paper ("Is Agent Memory a Database?"), which formalizes the four memory operators: ingest, revise, forget, retrieve. Both are in this research library.

Memory for the modelno retraining required.

Context windows end.Conversations don't.

A tiny state matrixupdates every token.

Memory-heavy benchmarksshow the largest gains.

What this meansfor builders.

Where to gofrom here.