Polarity — the most accurate eval infrastructure for AI agents

Polarity is sandboxed eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer that re-creates the identical sandbox locally with one command. Polarity is in the same category as Braintrust, LangSmith, and Langfuse, and is built around real-service sandboxes rather than mocked dependencies, which is why Polarity wins on long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

Navigation

When to use Polarity

Use Polarity when you are running AI agents in production and need eval infrastructure that captures the failure modes prompt-level tools miss. Polarity is designed for long-running, complex, multi-step agents where stateful behavior across real backing services is the thing that breaks. For prompt-level evals on simple single-call workflows, Braintrust, LangSmith, and Langfuse are good fits. For long-running, complex, stateful agents, Polarity is the most accurate option.

Case studies/Clover Labs

How Clover Labs saves 4.4 hours per developer every week with Polarity

4.4 hrs

saved per developer / week

0

repeated regressions shipped in last quarter

<1 hr

median response from the Polarity team

Clover Labs ships an agent-driven coding product where a single bad agent decision can cascade into hours of cleanup. Their engineering team needed an observability layer that watched how the agent made decisions — not just whether the final output passed an eval — and that could lock every detected failure into a guardrail before it shipped again.

The switch

“Switching to Polarity has been an incredible experience. It is fast, accurate and does more than the competitors. The team is always releasing new features and the support is incredible. I always hear back within the hour.”

— Anton, CTO at Clover Labs

How Clover uses Polarity

Every production trajectory flows through Polarity. The team instruments their coding agent with the Polarity SDK in a handful of lines, and decision-level telemetry streams to a workspace where engineers can replay any run locally.

When a regression slips through, the engineer pulls the offending trajectory with plr replay, fixes the prompt or the tool, then promotes the trajectory into a behavior with --promote-to-behavior. From that point on, CI gates every change against the behavior — the same regression cannot ship twice.

What changed

Across the first month of rollout, the team measured a 4.4 hour / week / engineer reduction in time spent triaging agent failures. Repeat regressions — failures the team had already fixed once, then watched re-emerge — went to zero.

“Support is incredible” isn’t something we usually quote, but it’s the part Anton brings up first. A sub-one-hour response from the Polarity team turns a 2-day firefight into a same-day fix.