Authors
research
April 16, 2026
The 80% Boundary Problem: Why Agents Escape Their Guardrails
The category of failures most evals miss — and how Polarity's invariants catch them before production.
The failure class
Most teams ship agents that pass their evals and then watch them escape their guardrails in production. We call this the 80% Boundary Problem: roughly four-fifths of the regressions we see in production are not the kinds of bugs that show up in offline test suites.
An agent doesn’t refuse a safe request — it complies with a subtly unsafe one. A retrieval-augmented agent doesn’t hallucinate a fact — it cites a real source that doesn’t actually support the claim. These are behavior failures, not output failures, and they live just outside the boundary the author drew.
Why most evals miss it
Standard evals score the final output of a trajectory. They can’t see whether the agent got there through the decisions the author intended. A trajectory can pass an output grader while violating an invariant the author would have flagged in code review.
We map four common escape paths: prompt-driven escape (the user nudges the agent past the declared line), scope-mismatch escape (auth allows it but user intent does not), tool-output escape (the agent trusts a bad tool result as authoritative), and emergent escape (the combination of variables nobody tested).
Behavior invariants
Polarity treats every recorded decision as a candidate witness for a behavior invariant — a rule that, once violated, should never ship again. Each invariant runs against every new trajectory, in production and in CI.
Invariants are cheap to author and they compound: once you catch tool-loop drift on the support agent, the same invariant guards every other agent that calls the same tool.
What we saw in production
Across the design partners we’ve worked with, behavior invariants surfaced regressions an average of 11 days earlier than the team’s existing eval suite. The biggest gains were on agents with long horizons and many tool calls.
Summary
Evals are necessary; they are not sufficient. The failures that hit users in production are mostly behavior failures, and they need behavior-level guardrails. That’s what Polarity is.