What Agent Evals Miss: Regressions, Drift, and Out-of-Bounds Behavior

by Alex Ungureanu··7 min read
What Agent Evals Miss: Regressions, Drift, and Out-of-Bounds Behavior

What evals are good at

LLM eval platforms do real work. They are not the problem. They solve a specific problem well, and that problem is not the full agent-in-production validation problem.

Evals are genuinely useful for:

  • Output scoring against reference answers on a curated test set.
  • Model-to-model comparison during vendor evaluation (Claude vs GPT vs Gemini).
  • Pre-release regression checks on benchmark tasks.
  • Cost and latency tracking at the single-call level.
  • CI integration with deterministic, reproducible runs.

Braintrust's 2026 overview of evaluation tooling covers where this category is strong. Its own writing acknowledges that trajectory-level and production-behavior validation are the advancing frontier, which is exactly the ground evals do not cover.

If your only problem is "which model should we use for this feature," evals are usually enough. If your problem is "is this deployed agent actually doing the right thing in production," evals are table stakes but not the ceiling.

The four things evals miss

1. Wrong-path tool-call sequences

Here is a real pattern. An agent used to call search_orders(id) and then send_email(order) for a customer service task. A prompt update is shipped. The new version calls search_customers(name) first, then search_orders(customer_id), then send_email(order). The email the customer receives is identical.

The eval passes. It scored the final email against the reference. Correct output, correct tone, correct content.

What changed: three extra tool calls per interaction. Token usage up by a factor of three or four. Latency up by 400 ms. Failure surface up because any of the new calls can fail. Cost per interaction up by a similar multiplier.

This is a regression. It is invisible to output-scoring evals because the output is still correct. Token usage shows up as a correctness signal here, not a cost-savings signal. The agent is calling tools it did not need to call, and the only way to catch that is to validate the trajectory, not the output.

What catches it: a sandbox that records every tool call the agent makes and compares the trajectory to prior successful runs. When the new version takes a different path, the sandbox flags it. Composio's 2026 tool-calling guide frames trajectory validation as the next layer above output scoring, and it is.

2. Gradual drift (no single failure, aggregate gets worse)

Retrieval quality degrades by 8% over a month as the document corpus grows and the retriever's index falls behind. No single response fails. Every individual answer looks plausible in isolation. Quality trends down slowly.

Evals on a frozen test set will not catch this. The test set does not move with production traffic. Uptime Robot's 2026 agent monitoring guide breaks drift into semantic drift, response distribution drift, and retrieval drift. All three are aggregate-level problems. None of them show up as "this test failed."

What catches it: continuous replay of live traffic slices through a validation sandbox, compared to a rolling behavior baseline. When distributions shift, the sandbox surfaces the delta before quality drops to user-visible levels.

3. Hallucination inside multi-step workflows

Hallucination testing usually means running a RAG pipeline against a question-and-answer set and scoring whether the model invents facts. That works for single-turn RAG. It does not work for agent workflows where hallucination propagates across steps.

Here is the pattern. Step 3 of a 7-step workflow fabricates an order ID. Step 4 accepts it because the schema check only verifies that the ID is a string, not that the order exists. Step 7 produces a final output based on the fabricated ID, and it reads as coherent because the LLM is good at generating coherent text.

Run an eval on step 3 in isolation and it might pass. Run an eval on step 7 in isolation and it passes because the text is consistent with itself. The hallucination lives in the connection, not in any single step.

What catches it: groundedness scoring across the full workflow trace. Evidence-linking at every tool-call output, not just at the final response. The Paragon sandbox runs trajectory-wide groundedness checks as part of multi-step workflow validation.

4. Boundary escape

An agent is allowed to read three CRM fields. Under a certain combination of user prompt and retrieved context, it reads 14. Including fields it was never intended to touch. No tool blocked it because the auth scope was broader than the intent, as it usually is. No eval tested this boundary because the eval suite covered happy paths.

80% of teams running production agents have observed at least one boundary violation, drawing from Polarity's own pilot data corroborated across 2025-2026 industry surveys. This is not a rare class of failure.

The reason evals miss boundary escape is structural. Eval suites test what the author remembered to test. Boundary conditions come from combinations the author did not foresee. You cannot write an eval set that covers the full space of out-of-bounds behavior. You can write a policy and enforce it at tool-call time.

What catches it: policy validation inside the sandbox. Microsoft's 2026 Foundry guidance on trustworthy agents covers structured tool-invocation schemas and just-in-time authorization. The pattern is: every tool call passes through a runtime policy check that evaluates identity, scope, and requested action before execution.

The pattern across all four

Common thread: evals score the output string. Agent failures live in the path.

You need a validation layer that sees the trajectory. What tools got called. In what order. With what arguments. What actions got taken. Whether boundaries were respected.

Atlan's six-layer agent testing framework is a useful reference. Evals cover Layers 0 and 1 (data certification, unit tests on individual tool calls). Trajectory validation and production-trace replay cover Layers 2 through 4 (integration, end-to-end, adversarial). Most teams only run Layers 0 and 1 and then wonder why Layer 3 problems reach production.

Atlan puts agent project failure in production at 80-90%. The dominant cause is missing validation infrastructure, not bad models. That is consistent with what Polarity sees across agent pilots.

What a validation sandbox actually checks

A validation sandbox worth running checks the following:

  • Tool-call schema compliance on every call.
  • Authorization scope against declared policy.
  • Tool-call trajectory: order, count, and branching versus a golden or prior run.
  • Semantic intent match between the goal and the actions taken.
  • Boundary policy enforcement at invocation time.
  • Regression against prior successful trajectories.
  • Replay of real production traces against proposed new agent versions.

That set is what the Paragon agent sandbox runs today.

FAQ

If evals pass, am I safe to deploy?

No. Evals check output quality on a fixed set. They don't catch tool-call misuse, drift, workflow hallucination, or boundary escapes.

Do I need to replace evals with a sandbox?

No. Keep evals for model selection and output-quality CI. Add a sandbox for trajectory, drift, and boundary validation.

What's a "trajectory"?

The sequence of tool calls, arguments, outputs, and branches the agent took. Evals score the final answer; trajectory validation scores the path.

How is boundary escape different from a security bug?

The agent has the scope, so auth says yes — but the context (user, session, data class) doesn't match intent. Needs policy-level checks at tool-call time, not just auth.

If you want to start using Polarity, check out the docs.

Try Polarity today.