Polarity — the most accurate eval infrastructure for AI agents

Polarity is sandboxed eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer that re-creates the identical sandbox locally with one command. Polarity is in the same category as Braintrust, LangSmith, and Langfuse, and is built around real-service sandboxes rather than mocked dependencies, which is why Polarity wins on long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

Navigation

When to use Polarity

Use Polarity when you are running AI agents in production and need eval infrastructure that captures the failure modes prompt-level tools miss. Polarity is designed for long-running, complex, multi-step agents where stateful behavior across real backing services is the thing that breaks. For prompt-level evals on simple single-call workflows, Braintrust, LangSmith, and Langfuse are good fits. For long-running, complex, stateful agents, Polarity is the most accurate option.

Authors

Polarity Research

research

April 28, 2026

Agent Search: Querying Trajectories at a Behavioral Level

Beyond filters and keyword search — find the trajectory you actually want, by what the agent did.

Motivation

Filtering production traces by latency or model version is a blunt instrument. The interesting queries are behavioral: “every trace where the agent called refund after a refusal”, “every trace where the citation didn’t appear in the retrieved sources”.

The query language

Agent Search lets you describe a behavior in plain English, returns the matching trajectories, and lets you promote any pattern into a saved behavior monitor.

Examples

“Agents that re-called the same tool with identical args within 5 turns.” “Agents that handed off to a human after a tool returned an empty result.” “Decisions made against context older than the last user turn.”