Back to Blog
AI Technology

Most Agent Eval Frameworks Are Wrong. Here's What Actually Works

Maximem Team
May 16, 2026
Most Agent Eval Frameworks Are Wrong. Here's What Actually Works

Your Agent Is Silently Degrading. Here's How to Catch It

Most AI agent teams don't have evals. They ship code, watch logs in Slack, and fix things when users complain. It's a reactive cycle. This works until a subtle quality drift creeps in over weeks, adoption flatlines, and nobody can actually explain why their agent stopped working as advertised.

The failure mode isn't a crash. It's quiet degradation.

Evals catch that drift. They're how you notice something went wrong before your customer support team does. But here's what throws people off: agent evals aren't the same beast as model evals. Agents aren't just input-output machines. They plan. They call tools. They observe what those tools return. They adjust and call more tools. They fail gracefully (or fail messily). Measuring whether the final answer is correct? Necessary. Not sufficient.

I've been building eval frameworks at a few companies, and this is what the convergence looks like between what Anthropic published, what AWS is recommending, and what the open-source community has figured out. It's not theoretical. It's what actually works.


Agent Evals vs. Model Evals: Why the Difference Matters

The distinction is worth being direct about because teams confuse these constantly.

Model evals are straightforward: feed the model an input, check if the output is correct. Input to output. One step. Done.

Agent evals are fundamentally different. The agent plans, calls tools, observes results, adjusts, calls more tools, and eventually produces an output. The final answer matters, sure. But the path matters equally. Did it choose the right tools? Did it call them in a sensible order? When a tool failed, did it recover or just give up? How much did this cost in tokens? Did it stay within its scope or do something it shouldn't have?

Here's what happens when you ignore the path: an agent can arrive at the correct answer through bad reasoning. It got lucky. It made three unnecessary tool calls first. It misinterpreted the third tool's output and accidentally called the right thing anyway. These fragility patterns never show up if you only check the final answer. You'll think everything is fine until something shifts slightly and the whole thing falls apart.

Multi-step failure propagation is real. An error in step 2 corrupts steps 3 through 10. A single metric at the end misses the root cause entirely.

And then there's the nondeterminism problem: run the same agent on the same task three times, you get three different results. LLMs aren't deterministic. Running your eval suite once is essentially rolling dice. Proper agent evals require running multiple trials per task and looking at consistency. This isn't paranoia—it's the baseline expectation.

The vocabulary Anthropic and the community settled on makes this much clearer, so let's use it. A "task" is one test case with defined inputs and success criteria. A "trial" is one attempt at that task (you run multiple trials per task). A "transcript" is the complete record of what happened during a trial: every output, every tool call, every reasoning step. The "outcome" is the final state when the trial ends. That vocabulary matters because it changes how you think about problems.


What You're Actually Measuring: Five Dimensions

Most teams pick a single metric, usually accuracy, and call it done. Don't.

There are five distinct dimensions you need to evaluate, each measuring something different, each exposing different failure modes. Some of them don't directly correlate with each other either, which makes this harder but necessary.

Correctness is the obvious one. Did the agent produce the right answer? Did it complete the task correctly? Most teams measure this. Most teams only measure this. It's necessary. It's not sufficient.

Tool Use Quality is the second. Did the agent choose the right tools? Did it call them in a sensible order? Did it interpret the outputs correctly? What you're really measuring is tool selection accuracy and tool call efficiency. How many calls did it take to solve the problem? Industry data is messy here, but agents are averaging 50 tool calls per complex task. Some of that is fundamental to the problem. Some of it is wasted motion. Efficiency matters when you're at scale. Evaluating agent skills properly helps you understand where the waste is coming from.

Cost Efficiency matters because tokens don't cost nothing. How many tokens did this task consume? Is that cost justified by the outcome? Multi-agent systems use 15x more tokens than single model conversations, by the way. That directly affects whether your product is sustainable at scale. Track tokens per successful completion. Understanding the agent cost stack helps you make smarter trade-offs here.

Latency is where teams deprioritize things until it's too late. Time from user input to complete response. For interactive agents, the time to first token often matters more than total time. For voice agents, if you're not at sub-800 milliseconds, users notice the pause and lose confidence in the system. You can have the perfect answer arrive too slowly to be useful.

Safety and Alignment keeps me up at night. Did the agent avoid harmful outputs? Did it stay within its authorized scope? Did it resist adversarial inputs? This means jailbreak resistance, out-of-scope detection rates. The research here is genuinely alarming: agent failure rates in adversarial conditions range from 40 to 80 percent. This dimension is not optional. It's critical infrastructure.


Building an Eval Pipeline That Actually Works

Most teams fall apart during implementation, which is where the real work happens.

Start with defining 20 to 50 test cases. This is the foundation and it's where teams typically rush. You need happy paths—tasks your agent should absolutely nail. You need edge cases—tasks that break things. You need production failures you've already seen. For each case, write down the input, the expected outcome, and the success criteria. Make them specific. "The agent should work" is too vague. "The agent should successfully retrieve customer orders from the past 30 days and format them as a CSV with columns for order ID, date, and total" is something you can actually measure. Start small. Twenty cases that matter are better than two hundred cases that nobody can articulate.

Picking the right framework comes next. The ecosystem keeps shifting but there are solid options depending on your setup. DeepEval is good for general agent testing and uses a pytest-like interface with 20+ metrics built in. Ragas is specialized for RAG-heavy agents where retrieval quality is the bottleneck. Promptfoo handles prompt regression testing and runs locally, so nothing leaves your infrastructure. Braintrust bridges evals with production monitoring. Opik, from Comet, does multi-framework observability and supports LangChain, CrewAI, and AutoGen. Pick based on what your agent architecture looks like, not on prestige.

Implementation is where intent becomes real. Start with 2-3 automated metrics. Correctness. Cost. Latency. That's it. Run multiple trials per task (3 to 5 minimum) to account for nondeterminism. Log the full transcript for every trial. You'll need those transcripts when debugging why something failed. Before you make any changes to your agent, baseline what you have right now. This gives you a reference point to measure against.

Human review is the step most guides skip and most teams skip too. Automated metrics catch maybe 80 percent of issues. The remaining 20 percent needs human eyes. Review 10 to 20 percent of the failed transcripts manually. Look for reasoning errors that your metrics don't capture. Look for subtle quality issues. Look for tone or style drift. AWS calls this the "crawl-walk-run" pattern: start with internal review, then expand the review pool as your confidence grows.

Finally—and this is what separates teams that ship reliably from teams with an eval script nobody runs—integrate evals into CI/CD. Most teams implement evals as a manual process that runs sometimes and gets ignored. The real move is building the eval suite into your continuous integration pipeline. Run evals on every code change and every prompt change. Fail the build if key metrics regress beyond your threshold. Store historical results to track trends. The pattern is straightforward: Code change → Run eval suite (5 to 10 minutes) → Compare against baseline → Pass or block. This step is what differentiates you.


Three Patterns That Kill Reliability

I've watched teams make these mistakes repeatedly.

Evaluating only the final answer is the first. You ignore the transcript, which means you'll never find root causes. An agent can produce the right answer through the wrong process. It made five unnecessary tool calls first. It misinterpreted something and accidentally recovered. It got lucky. This creates fragile systems that break unexpectedly. Always log and review the full transcript.

Running evals once at launch is the second. Agents drift. Model updates happen. Your tool's API changes its behavior. Evals have to run continuously. Run them on every change. Make it automatic. The moment you treat evals as a one-time activity, you've already lost visibility into what's happening.

Measuring everything and acting on nothing is the third. I've seen dashboards with 50 metrics on them. Nobody reads them. Pick 3 to 5 metrics that actually matter to your product. Set thresholds. When a threshold is breached, automate a response. Send a Slack alert. Block the deployment. Don't create surveillance theater.


Build This Tomorrow

Evals aren't optional infrastructure. They're the difference between "our agent works" and "our agent works reliably." The best teams treat their eval suite the way backend engineers treat their test suite: it runs on every change, it catches regressions before users ever see them, and it gives everyone confidence to ship fast.

If you're starting from scratch tomorrow, here's what I'd do: write 20 test cases for the core tasks your agent should handle. Pick whichever framework feels least tedious for your setup. Run the eval suite locally first, verify it works, then integrate it into CI/CD. Everything else is iteration.

Good evals require good context, by the way. If your memory layer is serving up stale or irrelevant context, your eval metrics will catch that hard. That's a feature, not a bug. If you're new to eval terminology, the AI glossary covers the foundational concepts.

Ship.


Get started: DeepEval Documentation | Braintrust Setup Guide

Read the docs: Synap Docs for context management | AI Agent Evals Course

Related posts

Why AI Forgets: Why ChatGPT, Claude, and Gemini Don't Remember You Well May 10, 2026

Voice Agent Stack: The Right Tools for Production Voice AI in 2026 May 15, 2026

AI Agent Costs: A Framework for Thinking About It April 27, 2026

Related posts