We launched Maximem Synap today. We also ran it against LongMemEval and scored 90.2%.
Then we did something we wish were more common in this space. We ran as many competitors as we could on the exact same harness, with identical hardware, identical harness and data. No system received special treatment or tuned configurations. Each ran in its standard production setup.
System | LongMemEval Score (observed) |
|---|---|
Synap | 90.2% |
Mem0 | 57.5% |
Zep | 63.8%* |
Supermemory | 71.3% |
The accuracy gap speaks for itself.
What LongMemEval actually measures
LongMemEval tests a specific, operational question: can your memory system retrieve the correct fact from a conversation history, and does that accuracy hold as the conversation grows longer?
The benchmark ingests conversation histories ranging from short exchanges to massive multi-turn dialogue spanning hundreds of thousands of tokens. It then asks factual questions about those conversations. What did the user say about their dietary preferences? When did the team decide to switch vendors? What was the customer's original complaint?
For each question, the system retrieves context and provides an answer. LongMemEval measures whether the answer is correct and whether accuracy degrades with conversation length. A system that scores 92% at 50K tokens and collapses to 65% at 500K tokens might look impressive in a demo environment but will not survive a production workload where conversations routinely grow past that threshold.
This matters operationally because agent builders consistently report spending 40% or more of their development time managing context: stuffing conversations into prompts, truncating history, segmenting context manually, implementing time-window logic that breaks in edge cases. A memory system that holds accuracy at scale becomes infrastructure you stop thinking about. A system that degrades becomes one more thing you work around.
Why the gap is architectural, not incremental
We will not claim this comes from clever prompt engineering or model selection alone (while they do play a role). The gap largely exists because most memory systems treat every agent, every domain, and every conversation identically.
A customer support agent cares about ticket history, resolution patterns, plan details, and previous escalations. A voice concierge cares about guest preferences, room availability, and booking constraints in real-time. A research assistant cares about source provenance, citation accuracy, and how the user's question has evolved over the course of a session. A universal memory architecture serves all three with the same pipeline, the same embedding strategy, and the same retrieval logic.
Synap takes a different approach. The context architecture is customized per agent. The retrieval pipeline learns what matters for a given agent and domain, and prunes what does not. This customization happens during setup, not during every inference call. You configure the shape of your problem once; retrieval adapts from there.
The practical consequence is that longer conversations make the system better, not worse. As more context accumulates, the entity graph grows richer, the resolution patterns become more precise, and the system gets better at distinguishing signal from noise. This is the opposite of what happens with universal vector-based systems, where the embedding space becomes noisy at scale and semantically similar but factually distinct statements become indistinguishable.
The dimensions LongMemEval leaves out
Accuracy is one dimension. It is not the only one that matters.
LongMemEval does not test consistency: whether you get the same answer if you rephrase the question slightly. It does not test false recall: whether the system ever confidently returns a memory that never happened. It does not measure context rot resistance: how well the memory system protects downstream agent output from degrading as context accumulates over long sessions.
These gaps are not a criticism of LongMemEval. It tests what it tests, and it does that well. But they explain why we are building ACM-Bench.
ACM-Bench will measure accuracy, memory consistency, false recall rate, latency profiles, token efficiency, and context rot resistance across ten realistic agent scenarios. It will be open-source, vendor-neutral, and designed so that any vendor can submit their system and have results published transparently. We are inviting all peers in the space to participate. (More on that, soon).
The goal is straightforward: end the dynamic where every vendor runs different benchmarks with different configurations on different hardware and claims victory. One benchmark. One protocol. Reproducible results anyone can verify. We are targeting a May release.
How to verify every number in this post
We open-sourced the entire LongMemEval evaluation harness. It is on our GitHub. You can download it, run it on your own hardware, and reproduce every number in this post.
The configuration is documented: hardware specifications, model versions, prompt templates, random seeds. If you want to run your own head-to-head comparison on your own infrastructure, the methodology is published and the code is published.
We would rather have someone surface a flaw in our methodology than publish results nobody can check. Transparency and reproducibility are the only way to build trust in benchmarks, and we think this space needs more of both.
Try Synap
You can start free at synap.maximem.ai. No credit card. Setup takes under a minute. Connect your framework and run an agent. You will see the latency yourself.
The first one hundred customers get three months of our Pro tier ($500 per month) for free. The developer SDK is open source. The eval harness is open source.
If you are evaluating memory infrastructure for a production agent, the benchmarks tell part of the story. But nothing replaces testing on your own conversations, your own domains, and your own latency constraints.
Start at synap.maximem.ai.



