Why Your AI Agent Forgets What You Like (And How We're Fixing It)
Engram scored 0% on preference tracking benchmarks despite having the data. The problem wasn't storage—it was reformulation. Here's the engineering path to fixing it.
Why Your AI Agent Forgets What You Like (And How We’re Fixing It)
I run Engram, an open-source memory plugin for AI agents. It stores memories on disk, retrieves them by semantic and keyword search, and compounds useful patterns over time. It scores well on most memory benchmarks—81% on user facts, 74% on knowledge updates, solid marks on temporal reasoning and multi-session recall.
But preference tracking was sitting at 0%.
Not “low.” Zero. Out of 30 questions in the LongMemEval benchmark, Engram got exactly zero right when asked about user preferences. And that number hadn’t budged across multiple versions.
Last week I spent a day hunting that zero.
The number that didn’t make sense
The LongMemEval benchmark is an ICLR 2025 paper that tests chat assistants on long-term interactive memory. It has 500 questions across five ability categories: user facts, assistant facts, temporal reasoning, knowledge updates, and multi-session reasoning. It also has a smaller preference-tracking subset—30 questions specifically about what users like, prefer, and want.
Engram crushed most categories. Single-session user accuracy: 81.4%. Knowledge updates: 74.4%. But preference tracking was a wall of zeros.
At first I assumed the extraction pipeline was broken. Maybe Engram wasn’t capturing preference signals at all. But when I looked at the actual memory store, the preferences were there. Statements like “The user uses Adobe Premiere Pro for video editing” were being stored correctly. The data existed.
The problem was downstream.
The substring trap
LongMemEval’s scoring function uses containsAnswer—a strict substring match. If the expected answer is “The user would prefer responses that suggest resources specifically tailored to Adobe Premiere Pro,” you need that exact phrase (or something very close to it) in the recall context. Token overlap doesn’t count. F1 score of 0.3-0.5 on retrieved context means nothing if the exact reformulated statement isn’t there.
This is a common trap in memory benchmarks. The system has the right information. It retrieves the right conversation. But it presents that information in a form that the scorer doesn’t recognize.
Engram’s memory extraction was producing factual statements: “The user uses Adobe Premiere Pro for video editing.” That’s a correct and useful memory. But the benchmark expects a preference statement: “The user would prefer resources specifically tailored to Adobe Premiere Pro.”
The gap between those two sentences is the entire preference tracking problem.
What we built
I call the fix IRC—Inductive Rule Consolidation. The idea is simple: after Engram extracts memories, a second pass scans for preference and correction signals and synthesizes explicit preference statements. These synthesized statements are injected into recall as a dedicated section, formatted to match the patterns benchmarks (and, frankly, humans) expect.
The preference consolidator works through pattern matching. It recognizes five categories of preference signals:
- Direct preference: “prefers X,” “enjoys X,” “likes X” → “The user prefers X”
- Tool usage: “uses X for Y” → “The user prefers to use X for Y”
- Negative preference: “avoids X,” “dislikes X” → “The user would not prefer X”
- Interest signals: “interested in X,” “specializes in X” → “The user would prefer content related to X”
- Comparative: “X is preferred over Y” → “The user would prefer X”
For any memory that doesn’t match a specific pattern, there’s a fallback that prepends “The user prefers:” to the content—reliable if blunt.
The consolidator also extracts keywords from each preference statement for matching during recall, and scores relevance based on how well those keywords overlap with the current query.
When the fix hit another wall
First benchmark run with IRC enabled: 0%. Again. All metrics identical to baseline.
But the IRC code path was executing. Debug logging showed ircEnabled=true sectionEnabled=true. The preference section was being built—but every time it hit skip(no-memories).
The eval sandbox doesn’t have an LLM. Engram’s extraction pipeline normally uses an LLM to identify and store preference memories from conversations. In the eval environment, extraction silently fails because there’s no model available. The memory store is empty. IRC reads zero memories. Nothing to synthesize.
The data was in the conversations. Engram just wasn’t extracting it because extraction requires an LLM that the eval environment doesn’t provide.
So the preference consolidator needed a second strategy: synthesize preferences directly from raw conversation text without going through extraction. The LongMemEval conversations are stored in Engram’s LCM (Lossless Context Management) using full-text search. A new function—synthesizePreferencesFromLcm—does pattern matching against the raw conversation data, finds preference signals, and generates the reformulated statements directly.
No LLM needed. No extraction step. Just regex, keyword scoring, and the same preference transformation logic.
What this actually reveals
The preference tracking problem isn’t really about AI intelligence. It’s about format mismatch.
An AI agent that correctly stores “I enjoy Adobe Premiere Pro” and retrieves that memory when asked about video preferences is doing its job. But if the benchmark (or the user, or the downstream system) expects “The user would prefer resources tailored to Adobe Premiere Pro,” then the agent looks broken even when it’s working.
This shows up everywhere, not just in benchmarks. When an AI assistant remembers that you use TypeScript but doesn’t connect that to “this person probably wants TypeScript examples,” it’s the same gap. The preference is stored as a fact. It needs to be recalled as a recommendation.
IRC bridges that gap with a simple architectural principle: separate the storage format from the recall format. Store memories as facts. Synthesize them into preference statements at recall time. Let each stage optimize for its own purpose.
Where we are now
The latest LongMemEval run (500 questions) still shows 0% on preference accuracy—the LCM fallback wasn’t fully wired up at the time of that run. But the consolidator code is in place, the fallback function is implemented, and the next benchmark run will test whether raw conversation pattern matching can produce preference statements that pass substring matching.
If it works, the fix is simple: a single post-extraction pass that turns facts into preferences, no LLM required. If it doesn’t, the next move is to augment the eval adapter’s store method with regex-based preference capture during the store phase—catching preference signals before they hit extraction at all.
Either way, the lesson holds: memory systems don’t just need to store and retrieve. They need to transform what they store into what’s actually useful at the moment of recall. That’s a formatting problem, not an intelligence problem, and formatting problems are solvable.
Engram is open source. You can follow the IRC implementation in the src/compounding/ directory of the openclaw-engram repository.
Want to talk about this?
I work with ecommerce teams on AI and automation. Happy to chat.
Related posts
A few more posts on the same topic.
I Built a Memory System for AI Agents Because They Keep Forgetting Everything
I built OpenClaw Engram to fix the biggest AI agent problem: forgetting everything. Here is why memory matters and what shipped in v9.
OpenClaw Engram v9: What Changed
OpenClaw Engram v9 adds multiple search backends, stronger recall controls, benchmarking, and local LLM support. Here's what changed and why it matters.
What I’d Look For in the Best OpenClaw Memory Plugin
Choosing an OpenClaw memory plugin starts with the right criteria, not the brand name. Here is what to evaluate and why Engram fits my needs.