A 0.12% parameter add-on gives AI agents the working memory RAG can't

AI agents forget. Every time a coding assistant loses track of a debugging thread, or a data analysis agent re-ingests the same context it already processed, the team pays in latency, token costs, and brittle workflows. The fix most teams reach for — expanding the context window or adding more RAG — is increasingly expensive and still doesn't reliably work.

To address this, researchers from Mind Lab and several universities proposed delta-mem, an efficient technique that compresses the model’s historical information into a dynamically updated matrix without changing the model itself. The resulting module adds just 0.12% of the backbone model's parameters — compared to 76.40% for one leading alternative — while outperforming it on memory-heavy benchmarks. Delta-mem allows models to continuously accumulate and reuse historical data, reducing the reliance on massive context windows or complex external retrieval modules for behavioral continuity.

The long memory challenge

The conventional solution is to simply dump all the information into the model’s context window.

But as Jingdi Lei, co-author of the paper, told VentureBeat, current systems treat memory merely as a context-management problem. “Either we keep expanding the context window, or we retrieve more documents through RAG,” Lei explained. “These approaches are useful and will remain important, but they become increasingly expensive and brittle when agents need to operate over long-running, multi-step interactions, and they don't really [work] like human memory since they are more like looking up documents.”

In enterprise settings, the bottleneck is not just whether the model can access history, but whether it can reuse that history efficiently, continuously, and with low latency. Standard attention mechanisms incur a quadratic computational cost as the sequence length increases. Furthermore, expanding the context window does not guarantee the model will actually recall the information effectively. Models often suffer from context degradation or context rot as they become overwhelmed with more (and often conflicting) information, even if they support one million tokens in theory.

The researchers argue for advanced memory mechanisms that can represent historical information compactly and maintain it dynamically across interactions. Existing solutions come with heavy trade-offs and generally fall into three paradigms:

Textual memory: stores history as text injected into context — constrained by window limits and prone to information loss under compression.

Outside-channel (RAG): encodes and retrieves from external modules — adds latency, integration complexity, and potential misalignment with the backbone.

Parametric: encodes memory into model weights via adapters — static after training, can't adapt to new information during live interactions.

Inside delta-mem

To achieve a compact and dynamically updated memory, delta-mem compresses an agent’s past interactions into an “online state of associative memory” (OSAM). This state is maintained as a fixed-size matrix that preserves historical information while the underlying language model remains frozen.

For enterprise workflows, this translates directly to resolving operational bottlenecks. Lei noted that a persistent coding assistant, for example, “may need to remember project conventions, recent debugging steps, user preferences, or intermediate decisions across a workflow.” Similarly, a data analysis agent might “need to maintain task state, assumptions, and prior observations while iterating over multiple tool calls.”

Rather than repeatedly retrieving and re-inserting all relevant history for these tasks, the delta-mem matrix provides a low-overhead way to carry forward useful interaction states inside the model’s forward computation.

During generation, the system does not retrieve raw text segments to add to the prompt. Instead, the backbone LLM’s current hidden state is projected into the matrix to retrieve old memory. This operation extracts context-relevant associative memory signals from delta-mem. These signals are then transformed into numerical corrections that are applied to the computations of the model. This steers the model's reasoning at inference time without altering its internal parameters.

Following each interaction, delta-mem updates the online state using “delta-rule learning.” When new information arrives, the previous state makes a prediction about the resulting attention values. It then compares this prediction to the actual value and corrects the memory matrix based on the discrepancy.

This update mechanism relies on a “gated delta-rule.” Basically, the memory module has different knobs that control how much previous memory is kept and how much of the new memory is applied. This error correction with controlled forgetting allows the matrix to evolve over time, holding onto stable historical associations without being derailed by short-term noise.

The researchers explored three strategies for determining when and how the matrix updates:

Token-state write captures fine-grained changes but is vulnerable to short-term noise.

Sequence-state write averages tokens within a message segment, smoothing updates at the cost of some localized detail.

Multi-state write decomposes memory into sub-states for different information types like facts or task progress.

Delta-mem in action

The researchers evaluated delta-mem across three LLM backbones: Qwen3-8B, Qwen3-4B-Instruct, and SmolLM3-3B. They configured the framework with a compact 8×8 matrix. The system was tested on general capability benchmarks, including HotpotQA, GPQA-Diamond, and IFEval. It was also evaluated on memory-heavy tasks such as LoCoMo, which tests long-term conversational memory, and Memory Agent Bench, which assesses retention, retrieval, selective forgetting, and test-time learning over extended interactions.

The framework was compared against representative models from the three existing memory paradigms: textual memory baselines (e.g., BM25 RAG, LLMLingua-2, and MemoryBank), parametric systems (Context2LoRA and MemGen), and the outside-channel approach MLP Memory.

Across the board, delta-mem outperformed the baselines, according to the researchers. On the Qwen3-4B-Instruct backbone, the token-state write variant achieved an average score of 51.66%, easily surpassing the frozen vanilla backbone at 46.79% and the strongest baseline, Context2LoRA, at 44.90%. On the memory-heavy Memory Agent Bench, the average score jumped from 29.54% to 38.85%. Performance on the specific test-time learning subtask nearly doubled from 26.14 to 50.50.

However, the most compelling takeaways are the system's operational efficiency. The researchers tested the framework in a no-context setting where the historical text was entirely removed from the context. Even without explicit text replay, delta-mem successfully recovered context-relevant evidence in multi-hop tasks. The researchers argue that the model remembers past interactions without needing to ingest massive amounts of prompt tokens.

The framework also adds only 4.87 million trainable parameters, representing just 0.12% of the Qwen3-4B-Instruct backbone. By comparison, the MLP Memory baseline required 3 billion parameters, scaling up to 76.40% of the backbone's size while delivering inferior results. When prompt lengths scaled up to 32,000 tokens during inference tests, the framework maintained almost the exact same GPU memory footprint as a standard, unmodified model. It sidesteps the heavy memory bloat that affects other advanced memory systems like MemGen and MLP Memory.

Different update strategies proved beneficial depending on the underlying model capacity. The sequence-state write strategy was the most effective for stronger backbones like Qwen3-8B. These more capable models use the segment-level writing to smooth out updates and mitigate token-level noise. Conversely, the multi-state write strategy drove massive performance leaps for smaller backbones like SmolLM3-3B. For these lower-capacity models, separating memory into multiple states proved critical to minimizing information interference.

Implementing delta-mem in the enterprise stack

The researchers have released the code for delta-mem on GitHub and the weights for their trained adapters on Hugging Face. For AI engineering teams looking to integrate this framework into their existing inference stack, the process requires minimal computing resources.

“In practice, an engineering team would start from an existing instruction-tuned backbone, attach the Delta-Mem adapter modules to selected attention layers, train only the adapter parameters on domain-relevant multi-turn or long-context data… and then run inference with the memory state updated online during interaction,” Lei said. Crucially, teams do not need a massive pretraining corpus. The training data only needs to reflect the target memory behavior, such as multi-turn dialogues, agent traces, or domain workflows where earlier information must influence later decisions.

While compressing interaction history into a fixed-size mathematical matrix creates immense efficiency, it does come with trade-offs. Delta-mem is not a lossless replacement for explicit text logs or document retrieval. Because different pieces of information compete inside the same limited state, there is a risk of memory blending.

“Delta-Mem is useful when the system needs fast, online, continuously updated behavioral state,” Lei said. “RAG is better when the system needs exact factual recall, citation, compliance, auditability, or access to a large external knowledge base.” Remembering a user’s working style or a multi-step reasoning trajectory is a perfect fit for delta-mem, while retrieving a legal contract or a medical guideline should remain in a vector database.

This means the most realistic enterprise architecture moving forward is a hybrid approach. Delta-mem acts as a lightweight internal working memory, reducing the need to retrieve or replay everything all the time, while RAG serves as the explicit, high-capacity memory layer.

“Looking ahead, I do not think vector databases will become obsolete,” Lei said. “Instead, I expect enterprise AI stacks to become more layered. We will likely see short-term working memory inside the model, longer-term explicit memory in retrieval systems, and policy or audit layers that decide what should be stored, retrieved, forgotten, or exposed to the user.”

Source link