Hallucination Rate

TL;DR

Quantitative measurement of how often an AI system produces factually incorrect or unfounded outputs

You need to know how often your system hallucinates. Hallucination rate is the percentage of outputs that contain hallucinations. Straightforward metric, hard to measure. You need to evaluate outputs for hallucinations. Manual evaluation doesn't scale. Automated evaluation is imperfect. Standard approach: sample outputs, have humans judge hallucination presence. If 50 outputs sampled, 5 contain hallucinations, hallucination rate is 10%. Simple math, hard execution. The definition of 'hallucination' matters. Obviously false claims? Subtle distortions? Claims without citation? Different definitions give different rates. You need to be explicit. For a legal system, any uncited claim might count as hallucination. For a casual chatbot, some creativity is acceptable. Context matters. The benchmark data also matters. Testing on your training distribution will likely underestimate hallucination rate. Testing on out-of-distribution data will likely overestimate. You need balanced benchmark data. Temporal tracking is important. Monitor hallucination rate over time. Is it increasing (model degradation, distribution shift)? Decreasing (mitigation working)? Stable? Trends matter more than absolute numbers. You might have 8% hallucination rate but that's 2% down from last month, so progress. The correlation with other metrics is interesting. Higher hallucination rate usually correlates with lower groundedness scores, lower user satisfaction, higher error rates. But not perfectly. Some hallucinations are harmless. Some correctness is useless. Synap's hallucination evaluation framework measures hallucination rates systematically across your test sets, helping you understand baseline rates and track improvement from mitigations.

Why It Matters

If you don't measure hallucination, you don't know if it's getting better or worse. You're flying blind. Measuring hallucination rate forces you to confront the problem and track whether your mitigations work. It's a key quality metric for any generative AI system.

Example

Your customer service AI has 12% hallucination rate (measured over 500 sampled outputs). You implement RAG. Remeasure: 7% hallucination rate. Clear improvement. You implement groundedness filtering: 4% hallucination rate. You're tracking progress objectively.

Related Terms

Measure and reduce hallucination rates