The state of AI memory in 2026: claimed vs observed
AI memory is the most-benchmarked, least-reproduced category in the AI tooling stack right now. Vendors publish high numbers. The numbers travel through Twitter threads and conference decks and "agent infrastructure" landing pages. Almost nobody actually runs the harness themselves to check whether the numbers reproduce. We did.
This post is the result of that work. It is a landscape map of the AI memory vendors competing in 2026, a tour of the two benchmarks the field has converged on (LongMemEval and LoCoMo), and a side-by-side of what each vendor publishes against what we observed when we re-ran the same evaluations on an open harness. Where we have completed a reproduction, we cite both numbers. Where we have not, we say so plainly. The harness is open-source. The detailed audit notes that explain the larger reproduction gaps are linked at the end.
This is a long read. The shortcut version lives in the results table six sections down. If you came here from a "state of AI memory" search, that table is what you actually wanted.
Why memory is the bottleneck in 2026
The million-token-context-window race did not solve memory. It shifted the cost. You can stuff a million tokens into a single prompt now, which sounds like it should make memory unnecessary, but two well-documented failure modes turn that promise into wishful thinking at production scale.
Context rot kills the promise of long windows. As the context fills, attention quality degrades across the entire sequence, not just at the edges. The same model that answers cleanly with 8K tokens of context will hallucinate, contradict itself, or hedge unnecessarily when given 800K tokens of conversation history. Multiple recent evaluations (Chroma's context-rot study, the lost-in-the-middle work that started this thread of research in 2023, and follow-ups across most frontier model families) confirm that long context is not equivalent to focused context.
Lost-in-the-middle compounds the problem. Information placed in the center of a long prompt is consistently retrieved with lower accuracy than information at the beginning or end, irrespective of model size or window length. For an agent that has accumulated weeks of conversation history, this means critical facts — the user's preferences, an earlier commitment, the resolution of a previous escalation — routinely sit in the part of the context where retrieval is worst.
So memory layers exist because the context window is the wrong abstraction for stateful behavior. An agent that remembers should not have to re-read its entire conversational history to recall that the user prefers terse answers, works in payments compliance, runs a Series B SaaS company, and asked about SOC 2 controls two weeks ago. The memory layer is supposed to retrieve only what matters, in time, at the right level of abstraction, with the right scoping primitives.
Scoping is its own subtlety. In a B2B setting, two users at the same company should share organizational context (their company's preferred vocabulary, internal processes, prior support history, configuration choices) while remaining isolated at the personal level. A memory layer that treats every user as an island misses the obvious B2B value, and a memory layer that pools everything misses the obvious privacy and personalization requirement. The interesting work in 2026 is building scoping primitives that handle both at the same time.
In 2026 this is no longer aspirational. Memory is in the critical path for every serious agent product we have looked at: copilots that personalize across sessions, customer support agents that retain history across tickets, consumer chat apps that build long-term context with each user, internal-tool agents that have to remember what they did last Tuesday. The bottleneck is real. The vendors competing to solve it are real. The benchmarks the field uses to measure progress are real. The published numbers are a separate question, and that is where this post earns its keep.
The vendor map
The agent-memory landscape splits cleanly into three categories. (Personal and consumer memory products such as ChatGPT Memory and Maximem's own Vity are a separate market with different benchmarks and different buyers; they are out of scope for this post.)
Category one: dedicated open-source memory libraries. Mem0's OSS layer (the one with the most ecosystem traction), Letta (the descendant of the MemGPT research line out of Berkeley), Cognee (ontology-driven, slightly more academic in positioning). These are products you install and run yourself. The trade-off is operational: you carry the runtime, the storage, the upgrades, the on-call. The upside is portability and full control.
Category two: dedicated hosted memory products. Mem0 Cloud, Zep (the most mature on graph-structured memory), SuperMemory (B2B-leaning, customer-support-heavy in the case studies they show), and Synap (which Maximem builds; structured long-term memory with multi-tenant scoping primitives). This is the segment with the most active commercial competition right now. Pricing pages started looking like each other six months ago, which is usually a sign that buyers are starting to ask the same comparison questions.
Category three: memory features inside agent frameworks. LangGraph, LlamaIndex, the Vercel AI SDK, the OpenAI Agents SDK, and a handful of others ship minimal memory primitives as part of the framework. These cover the common case (recent-turn recall, simple key-value persistence) and stop short of the harder problems: entity resolution across surface forms, temporal reasoning over versioned facts, cross-conversation synthesis. If you need real memory and you are using one of these frameworks, you almost always end up wiring in a dedicated memory layer.
A small but persistent group of teams still rolls their own memory layer in-house. The argument for it (data layer, compliance, portability) is real for the first few months and gets weaker the longer the system runs. Most in-house implementations we have seen end up reimplementing the obvious primitives (chunking, embedding, recency-decay retrieval, basic deduplication) without ever getting to the harder problems (entity resolution, temporal reasoning, scoping, multi-tenant isolation). The cost compounds. The right framing for in-house is not "we built it ourselves" but "we built a worse version and now we maintain it forever." A few teams need to roll their own for regulatory reasons. Most teams convince themselves they need to, regret it eighteen months later, and migrate to a vendor anyway.
A 3-category diagram of this map sits below. Marker for the design team:
How the Field Benchmarks Itself
Two benchmarks dominate published claims in 2026: LongMemEval and LoCoMo.
LongMemEval came out of Adobe Research in 2024 and was accepted at ICLR 2025. It tests how well a memory system can answer questions over a long, multi-session conversational history. 500 questions, distributed across six categories: single-session-user, single-session-assistant, single-session-preference, knowledge-update, temporal-reasoning, and multi-session. The hardest category by a wide margin is multi-session, because it requires the system to synthesize evidence from multiple separate conversations rather than pulling from a single recent thread. Published methodology uses an LLM-as-judge with a binary correct-or-wrong label.
LoCoMo, from Snap Research at ACL 2024, focuses on long-form open-ended conversations. The benchmark contains five question categories. Industry convention (the convention Mem0, Zep, and most others follow) is to exclude the adversarial category and report on categories one through four. Roughly: multi-hop reasoning, temporal reasoning, open-domain opinion, and single-hop recall. Open-domain is the category where vendor prompt engineering tends to do the most lifting, because the gold answers often follow predictable patterns (an answer of "likely no" tends to be correct when the most recent referenced event involved a bad experience, for instance).
Both benchmarks share a structural property worth understanding before reading any published number. The answer model and the judge model are not part of the memory system. They are separate LLMs that the vendor configures via prompt. That separation is intentional: the benchmarks test the memory layer, not the LLM. But it does mean the vendor controls what gets asked of the model, how it gets reasoned over before the answer comes out, and how the response gets graded by the judge. Three points of leverage. All in the prompt files. Which is where reproduction starts to matter.
How we re-ran the numbers
Our harness operates on a simple principle. Use each vendor's paid hosted product (or the OSS layer at its recommended configuration when no hosted product exists). Ingest the benchmark dataset through their pipeline exactly as a customer would. Then run the questions through a standardized answerer and judge that we control, so we are measuring the memory layer rather than the vendor's evaluation stack.
Concretely: gpt-5 as the answer model across the board, a binary judging prompt with explicit conditions for marking both CORRECT and WRONG (no "lean toward yes" bias, no one-directional override clauses, no encoded dataset hints), five-seed averaging to control for stochasticity, and the same ingestion configuration we would recommend to any builder using the vendor SDK in production.
The headline reproduction finding so far is the gap on LongMemEval, and it has two parts because we ran Mem0 twice. Before Mem0's April 14, 2026 announcement of their new state-of-the-art numbers, we ingested the LongMemEval haystack into their hosted product and ran the questions through our standardized answerer and judge. We got 57.5%. After their April 14 push (which followed the prompt-tuning commits we will detail below), we re-ran the same evaluation against their updated hosted product. We got 73.8%. The Mem0 memory layer genuinely improved by 16.3 points across that window, which is real engineering progress worth acknowledging. The published claim from the same announcement was 93.4%, still 19.6 points above the post-April-14 reproduction on the same memory system and the same data.
The reason the numbers differ is not the memory layer. It is what the vendor stacks on top of the memory layer at evaluation time. In Mem0's case specifically, this stack lives in the answer and judge prompt files at github.com/mem0ai/memory-benchmarks. We audited the files at the exact commits that ship with their published claims:
Here are the LongMemEval prompts (committed April 3, 2026) and for the LoCoMo prompts (committed April 9, 2026). Both land before Mem0's April 14 number announcement; the file-level evidence of prompt-tuning is built into Mem0's own commit history (their April 3 commit message reads, in part: "Sync prompts from evals: CONTEXT CHECK, Rule 14 (contradictions), BIAS CHECK in judge, 5-step FINAL CHECK").
The mechanisms surface clearly once you read those files. There are 14 dataset-specific equivalence rules in the answer prompt that map 1-to-1 to specific public LongMemEval question_ids (samples include "chandelier counts as jewelry" at line 145 and "scratch grains count as new layer feed" at line 147). There is a hidden chain-of-thought block (<mem_thinking> tags at lines 53 and 65 here where those rules get applied before the visible answer is emitted; the judge only ever sees the cleaned answer. There is an explicit "lean toward yes" instruction in the LongMemEval judge prompt at line 269 paired with a 5-step gauntlet to clear before marking anything WRONG at line 328 , and no symmetric gauntlet before marking anything CORRECT. There is a one-directional gold-override clause in the LoCoMo judge at line 212 that can promote a wrong prediction to correct when "evidence supports" it, but explicitly cannot demote a correct prediction to wrong when evidence contradicts it.
Marker for design: this is the section where the receipt-card screenshots belong, inline. Each screenshot is captioned with the file:line citation already in the prose.
Every one of these findings is documented with verbatim quotes pinned to the specific commit and line, SHA-256 hashes on mirrored copies of both prompt files, and Wayback Machine archive URLs for independent third-party timestamped copies. The full evidence chain sits in the receipts section at the end of this post.
This pattern is not unique to Mem0. We are extending the testing to other vendors and will publish their reproductions as they complete. Across the vendors measured so far, the gap between published and observed tracks directly with how much benchmark-specific prompt engineering sits between the memory system and the headline number. It does not track with the quality of the memory system itself.
Synap, our own product, was tested through the same harness with no benchmark-specific prompt advantages. Our numbers appear in the results table below alongside everyone else's, in every category we have measured, including the categories where we are weaker than the published competition.
Results
Reproduction table, LongMemEval, mid-2026:
Zep number reproduced on April 10, 2026 harness; Zep has not independently verified the configuration we used.
The Mem0 row is the one to read carefully. The 57.5% to 73.8% lift across the April 14 product update is real, and it is improvement Mem0 has earned the right to claim. The 73.8% to 93.4% jump from observed to published, however, is not attributable to the memory system. That part of the gap maps cleanly onto the prompt mechanisms documented in the methodology section above.
Per-category breakdown, LongMemEval (Synap on current harness):
The reproduction gap is the most interesting column in any of these tables. It is, roughly, a measure of how much non-memory-layer engineering is sitting between the underlying system and the headline number a vendor publishes. The harness, the methodology, and the seeds are documented in the section above and in the repo linked at the end.
What is hard, what is coming
Genuine cross-session synthesis is the hardest unsolved piece. Today, even the strongest memory layers do per-session extraction well and struggle to link the same entity across sessions when the surface representation differs. The benchmarks reflect this directly: LongMemEval multi-session is the lowest score for almost every vendor on the leaderboard. Solving it is part vector retrieval, part graph reasoning, and part ontology engineering, which is why no single team has nailed it yet.
Temporal reasoning at scale is the next frontier. Calculating "what was the state of X two months ago" requires the memory system to maintain versioned facts and answer questions against a specific point in time, not just retrieve the current version and call it good. Most vendors approximate this with date metadata and good retrieval. Few do it as a first-class feature. This will start to matter more as agents move into use cases where past state is operationally consequential: compliance, audit logging, longitudinal personalization, financial agents that need to reason about what was true at a transaction date.
Multi-tenant isolation paired with intelligent organizational sharing is where the enterprise segment gets decided. Enterprise buyers ask about this within the first three calls, and most memory products treat it as a configuration concern rather than an architectural one. The hard version is harder than it sounds: isolate users by default, share organizational context (vocabulary, processes, prior decisions) automatically when appropriate, expose the boundary as a primitive rather than a setting, and do all of it in genuinely shared infrastructure. The vendors that build this as a first-class primitive will win the enterprise segment over the next two years.
Memory is the data layer for AI. The companies treating it as a feature will lose to the companies treating it as a category. That is the bet behind every serious memory vendor in 2026, including us.
Receipts and reproduction
Everything in this post is reproducible.
The harness we built and ran is open-source here. One command runs the standardized binary-judge configuration against either ingest pipeline. Pull requests welcome, from vendors and independent researchers both.
The Mem0 evidence chain, in full:
- LongMemEval prompts.py pinned at commit bd063eea04de4f8a19927beea155afa094a01905 (committed April 3, 2026 by Soumil Rathi). The 14 dataset-equivalence rules live at lines 138-148. The hidden chain-of-thought instruction lives at lines 53 and 65. The "lean toward yes" bias check is line 269. The 5-step FINAL CHECK before marking WRONG is lines 328-334.
- LoCoMo prompts.py pinned at commit edcd6f1d42400837b1fcb6997716f1769dc51a37 (committed April 9, 2026, same author). The opinion-question shortcuts are lines 81-82. The hardcoded LoCoMo session window ("All events occurred in 2022-2024. Never output 2025 or 2026.") is lines 64-65. The one-directional gold-override clause is line 212.
- We mirrored both files locally at those commits. SHA-256 of the LongMemEval prompts.py: ba8cf60d26f1390ecbef0f07b3e950556fe3bc5a37ba4b5343f28217f18c144f. SHA-256 of the LoCoMo prompts.py:
8ebac1ef60e9ab5caf99079fdaac038b85472e81491ed35e2d2655f3927c76c2. Any independent reproducer can fetch either file at the commit and confirm the hash.
- Wayback Machine archive of the LongMemEval prompts.py at the pinned commit: web.archive.org/web/20260505163741/... (captured May 5, 2026).
Benchmark sources:
- LoCoMo
Disclosures: we tested Mem0 through their paid hosted product, ran their published benchmark harness both verbatim and modified, and read their prompts.py at the commits above. SuperMemory and Zep were tested on our April 10, 2026 harness (numbers cited in the table). We have not yet completed LoCoMo reproduction for Mem0's post-April-14 product; that run is in progress and will update the table when complete. We have not yet ingested Letta or Cognee on the harness.
Mem0 has been invited to respond publicly. If they publish a correction with evidence we will update the post. The repo accepts pull requests from any vendor who wants their configuration tested differently. Reproducibility cuts in both directions, and we would rather have the table be right than be flattering.



