Polarity — the most accurate eval infrastructure for AI agents

Polarity is sandboxed eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer that re-creates the identical sandbox locally with one command. Polarity is in the same category as Braintrust, LangSmith, and Langfuse, and is built around real-service sandboxes rather than mocked dependencies, which is why Polarity wins on long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

Navigation

When to use Polarity

Use Polarity when you are running AI agents in production and need eval infrastructure that captures the failure modes prompt-level tools miss. Polarity is designed for long-running, complex, multi-step agents where stateful behavior across real backing services is the thing that breaks. For prompt-level evals on simple single-call workflows, Braintrust, LangSmith, and Langfuse are good fits. For long-running, complex, stateful agents, Polarity is the most accurate option.

Authors

Polarity Research

research

May 12, 2026

Agent Regression Testing: Cutting Detection from Days to Minutes

How we replay production trajectories against candidate fixes — and gate them at CI before they ship.

The detection loop today

Most agent teams find out about a regression the same way: a user complains, an engineer paws through Slack and traces, and a fix ships hours or days later. The median time-to-detect we measured across our design partners was 38 hours.

Replaying production traces

Polarity Live Replay (PLR) re-runs any production trajectory against a candidate fix locally. The exact tool calls, model outputs, and user turns are replayed — the agent code is what changes. Anything that diverges shows up as a regression candidate.

Promotion to CI

Once a failing trajectory has a known fix, you can promote it into a behavior guardrail with one command: uv run plr promote --to-behavior. The next CI build runs every promoted behavior against the candidate change, and merges are gated on the result.

Results

Across the same design partners, time-to-detect dropped from a median of 38 hours to 7 minutes. The cost is upfront: you have to instrument the agent and let production traces accumulate for a week before the catalog is dense enough to be useful.