Groundedness is whether an AI output actually connects to reality. A grounded response about climate change cites actual studies. A hallucinated response confidently describes studies that don't exist. Groundedness evals attempt to measure this. The mechanics: given an AI output and its source material, does the output stay within the bounds of what the sources actually say? This is harder than it sounds. First problem: defining the source material. If your AI used retrieval to fetch documents, those documents are sources. If your AI used only its training data, everything is a source (but unverifiable). Groundedness evals typically focus on the retrieval case because it's auditable. Second problem: measuring groundedness is expensive. You need human judgments (does this claim follow from the sources?), which don't scale. Or you need automated metrics that capture groundedness, which are imperfect. Methods vary. Simple approaches: check if key entities from the output appear in the source material. More sophisticated: use NLI (natural language inference) models to check if the output logically follows from the source. Some systems use LLM-as-judge: feed the output and source to a strong LLM, ask if it's grounded. Works but has bias issues. The tradeoff with hallucination is interesting. You might achieve high groundedness by being vague and safe ('I'm not sure' is grounded). But users don't want vague. So groundedness evals need context. For some tasks, high groundedness matters more than comprehensive answers. For others, you accept some hallucination risk for better coverage. The most pragmatic approach: groundedness evals as one signal among many, not the only metric. Synap's groundedness eval tools help developers measure how well their AI systems stick to source material, crucial for building trustworthy systems that cite evidence.
Why It Matters
Hallucination is the plague of current AI systems. An output that sounds confident but is completely made up damages trust. Groundedness evals force you to measure and improve this critical failure mode. They're essential for any AI application where accuracy matters. User trust depends on knowing outputs are grounded in evidence, not imagined.
Example
You build an AI for customer support that answers questions about your product. Groundedness evals check: does each answer cite actual documentation? Does it make claims unsupported by product specs? Without groundedness evals, your AI might confidently claim the product has features it doesn't, confusing customers. With evals, you catch these hallucinations before they reach users.