THE BENCHMARK QUESTION

The benchmarks are rigged. Everyone grades themselves.

This one cuts against us, because we lead with being first on LongMemEval, so let me meet it head-on instead of around it.

The criticism is fair. Much of this category grades its own homework, vendor disputes vendor on methodology, and there is little independent validation, which means every number, including ours, should be read with suspicion until you can see how it was produced.
The fair version

So here is how we handle it. We show the methodology and link the evaluation, we report the cases where we lose, and we run the tests the skeptics correctly say nobody runs, memory pollution with contradictory facts, strategic forgetting, and concurrent-user stress, against ourselves. And we think the metric itself is usually the wrong one. The question is not "did it retrieve the right note." The question is "did memory improve the next action." We would rather be measured on that, even when it is harder.

The three figures we cite are the same everywhere on this site, and each one comes from a published run you can reproduce. On LongMemEval we score 92%. On LoCoMo we score 93.2%. P50 retrieval latency is under 15ms. The harness is open source. Re-run it.

A number you cannot inspect is a claim, not proof. Inspect ours.

See the methodology behind the numbers.