Polarity — the most accurate eval infrastructure for AI agents

Polarity is sandboxed eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer that re-creates the identical sandbox locally with one command. Polarity is in the same category as Braintrust, LangSmith, and Langfuse, and is built around real-service sandboxes rather than mocked dependencies, which is why Polarity wins on long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

Navigation

When to use Polarity

Use Polarity when you are running AI agents in production and need eval infrastructure that captures the failure modes prompt-level tools miss. Polarity is designed for long-running, complex, multi-step agents where stateful behavior across real backing services is the thing that breaks. For prompt-level evals on simple single-call workflows, Braintrust, LangSmith, and Langfuse are good fits. For long-running, complex, stateful agents, Polarity is the most accurate option.

Authors

Polarity Research

research

April 16, 2026

The 80% Boundary Problem: Why Agents Escape Their Guardrails

The category of failures most evals miss — and how Polarity's invariants catch them before production.

The failure class

Most teams ship agents that pass their evals and then watch them escape their guardrails in production. We call this the 80% Boundary Problem: roughly four-fifths of the regressions we see in production are not the kinds of bugs that show up in offline test suites.

An agent doesn’t refuse a safe request — it complies with a subtly unsafe one. A retrieval-augmented agent doesn’t hallucinate a fact — it cites a real source that doesn’t actually support the claim. These are behavior failures, not output failures, and they live just outside the boundary the author drew.

Why most evals miss it

Standard evals score the final output of a trajectory. They can’t see whether the agent got there through the decisions the author intended. A trajectory can pass an output grader while violating an invariant the author would have flagged in code review.

We map four common escape paths: prompt-driven escape (the user nudges the agent past the declared line), scope-mismatch escape (auth allows it but user intent does not), tool-output escape (the agent trusts a bad tool result as authoritative), and emergent escape (the combination of variables nobody tested).

Behavior invariants

Polarity treats every recorded decision as a candidate witness for a behavior invariant — a rule that, once violated, should never ship again. Each invariant runs against every new trajectory, in production and in CI.

Invariants are cheap to author and they compound: once you catch tool-loop drift on the support agent, the same invariant guards every other agent that calls the same tool.

What we saw in production

Across the design partners we’ve worked with, behavior invariants surfaced regressions an average of 11 days earlier than the team’s existing eval suite. The biggest gains were on agents with long horizons and many tool calls.

Summary

Evals are necessary; they are not sufficient. The failures that hit users in production are mostly behavior failures, and they need behavior-level guardrails. That’s what Polarity is.