Hallucination Testing for Production Agents: Why Evals Aren't Enough
What RAG hallucination testing does well
Standard RAG hallucination testing is not a dead end. It solves a specific part of the problem well.
- Groundedness scoring. Does this response cite or reflect the retrieved context? Tools like Galileo and Maxim's 2026 hallucination guide cover this bucket in depth.
- Evidence linking. For each claim in the response, what retrieved passage supports it? If no passage supports it, flag the claim.
- Faithfulness evaluation. Does the response contradict the retrieved context?
For a QA system that takes a question, retrieves documents, and generates an answer, those three checks are enough to catch the dominant failure mode. Ship them, run them in CI, move on.
Agents break the assumption that a response is the last thing the system does.
Where agent hallucination is different
1. Fabricated IDs that survive downstream steps
Step 3 of a workflow asks the model for an order ID. The model invents one. The shape is correct: six digits, fits the schema. Step 4 calls a tool with that ID as an argument. The tool returns something reasonable (empty result, cached data, or worse, data for a different order). Step 7 produces a final email referencing the fabricated ID as if it were real.
A RAG hallucination check on step 7 does not catch this. The final response is internally consistent. The fabricated ID never appeared in retrieved context because there was no retrieval at step 3.
2. Chained-inference hallucinations
The model infers A from B, then infers C from A, then infers D from C. Each inference step looks locally plausible. The starting inference A was wrong. By the time you reach D, the chain is coherent text that is entirely disconnected from ground truth.
Single-turn evals on any step in isolation can pass, because each step is consistent with its local context. The hallucination lives in the connection between steps, not in any single step.
3. Tool output treated as ground truth
Agents frequently ingest tool output and write about it in natural language. When a tool returns a degraded result (missing fields, stale data, error string interpreted as content), the agent often wraps it in confident prose anyway. The model is not grounded against retrieval in this case; it is grounded against tool output, and tool output is not always truth.
Groundedness scoring against retrieval misses this because the check is scoped to retrieved context, not tool output.
Groundedness at the trace level
The generalization is: groundedness has to be checked at every step in the trace where the model produces a claim that downstream steps depend on.
That means:
- At each generation step, extract the claims.
- For each claim, identify the evidence source (retrieval, tool output, prior step output).
- Check each claim against its evidence source.
- Propagate a groundedness score across the trace, not just at the final step.
Maxim's 2026 guide covers evidence linking well for single-turn cases. The extension for agents is that the evidence source is the trace itself, not just retrieval, and the check happens at every model turn.
Sandbox approach vs eval approach
Eval approach (RAG-centric):
- Run a test set of questions through the agent.
- Score final response groundedness against retrieval.
- Flag failures.
This catches RAG hallucinations. It does not catch fabricated IDs, chained inference, or tool-output hallucinations.
Sandbox approach (trace-level):
- Replay real production sessions through the agent inside a sandbox.
- At every generation step, record the model's claims and their dependencies.
- Check each claim against the evidence source available at that step.
- Score the trace, not just the final response.
This catches the agent-specific failure modes because it sees where in the trace a claim was introduced and whether that claim was supported by anything.
Paragon runs trace-level groundedness as part of workflow validation. Most 2026 eval platforms handle RAG groundedness well and are extending into trace-level checks; the two approaches converge over time.
FAQ
Do I need both eval and sandbox hallucination checks?
Usually yes. Evals for prompt/model change CI, sandbox for pre-deploy trace-level groundedness. Small overlap, combined coverage.
What if my agent doesn't use retrieval?
Trace-level groundedness still applies. Evidence sources shift from retrieved docs to tool outputs and prior steps.
How do I extract claims automatically?
LLM-as-judge is the 2026 standard. A dedicated extractor returns structured claims per step. Paragon, Galileo, and Maxim all ship this.
Can a sandbox catch every hallucination?
No. Long-tail cases still need observability post-deploy.
If you want to start using Polarity, check out the docs.