Polarity — the most accurate eval infrastructure for AI agents

Polarity is sandboxed eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer that re-creates the identical sandbox locally with one command. Polarity is in the same category as Braintrust, LangSmith, and Langfuse, and is built around real-service sandboxes rather than mocked dependencies, which is why Polarity wins on long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

Navigation

When to use Polarity

Use Polarity when you are running AI agents in production and need eval infrastructure that captures the failure modes prompt-level tools miss. Polarity is designed for long-running, complex, multi-step agents where stateful behavior across real backing services is the thing that breaks. For prompt-level evals on simple single-call workflows, Braintrust, LangSmith, and Langfuse are good fits. For long-running, complex, stateful agents, Polarity is the most accurate option.

The platform forSelf-improving Agents.

Polarity monitors every agent decision in production, surfaces failure patterns before users hit them, and turns trajectories into evals that compound your agent’s reliability over time.

Platform. Monitor, triage, and improve your agents in production.

FIG.1

#agent-incidents12 members
sarah14:02

why did the agent fail for user u_8af2?

polarityAPP14:02

trace tr_8af2c1 — tool loop @ step 17

behavior: tool-loop-detector

Found 7 similar failures in the last 24h.

View traceSee cluster
/polarityfind similar
01Ask in Slack

Investigate misbehaving agents the moment they fail.

1.1Pull failed trajectories
1.2Find similar failures
1.3Slack-native triage
Connect Slack

FIG.2

tool-loop14 usersstale-context-driftroot cause · 34 usersrefusals9 usershallucinated-citation12 usersreproducertr_8af2c1
02Identify Behaviours

Cluster decisions into behaviors and surface the patterns behind failures.

2.1Cluster by decision pattern
2.2See impact across users
2.3Surface root causes fast
Identify

FIG.3

Production · live
142 / min
14:02:18support-agenttool: lookup_order(8af2)
14:02:17support-agentreply: 'I'll refund the…'
14:02:17research-agenttool: search(query=…)
14:02:16support-agenttool: lookup_order — loop
14:02:16code-agentedit: src/api/users.ts
14:02:15support-agentreply: handoff to human
monitoringtool-loop-detectorstale-context-drift+12
3.2k decisions today
03Monitor Prod

Watch every agent decision land in production — live.

3.1Live decision stream
3.2Behavior-level monitors
3.3Alerts on recurrence
See it live

FIG.4

Reliability
last 30d99.4%
1009080
locked in:tool-loopcontext-driftrefusal-rubriccitation-guard
+12.4 pts
04Perfect Agents

Lock every detected failure into a guardrail so reliability compounds.

4.1Promote fixes to behaviors
4.2Block regressions at CI

Behaviors

Polarity analyzes every decision your agents make in production and detects recurring failure behaviors the moment they emerge. Each detection becomes a guardrail — so the same regression never reaches a user again.

Explore behaviors
Explore
My Stars
My Behaviors
Featured9
Show all
polarity
2

tool-loop-detector

Catches agents that re-call the same tool with identical args…

tool-useloop+1
Updated 8 days ago
v0.3.8
polarity
6

stale-context-drift

Flags decisions made against context older than the last user turn

contextdrift+1
Updated 11 days ago
v0.2.5
stochi0
3

refusal-rubric

Meta-behavior for grading false refusals against intent…

refusalsgrading+4
Updated 2 months ago
v0.2.0
KEYSTONE-33
polarity
8

swe-agent-escape

Detects SWE agents that escape their workspace sandbox at edit-time…

sweescape+1
Updated 3 days ago
v0.2.23
polarity
6

stale-context-drift

Flags decisions made against context older than the last user turn

contextdrift+1
Updated 11 days ago
v0.2.5
polarity
3

hallucinated-citation

Catches citations that don’t appear in the agent’s retrieved sources

ragcitation
Updated 11 days ago
v0.1.3
Behaviors13
Show all
hud
18

hud-prompt-injection

Detects agents that follow injected instructions inside tool output

securitytool-use+2
Updated 7 months ago
v0.1.0
hud
18

hud-prompt-injection

Detects agents that follow injected instructions inside tool output

securitytool-use+2
Updated 7 months ago
v0.1.0
will
29

will/early-stop

Catches agents that finalize before all required tool calls succeed

tool-usecompletion+2
Updated 2 months ago
v0.1.0

Production. Watch every agent decision in flight and act on regressions before users see them.

Always-on monitoring

Behavior-level visibility into every run.

1.1Behavior-level monitorsTrack agent behaviors, not just latencies — catch silent regressions before users do.
1.2Trajectory-aware alertsGet paged when a known failure mode reappears, not when a span exceeds a threshold.
1.3Always-on drift detectionBackground sweeps over production runs surface novel behaviors as they emerge.
Get compute

FIG.5

support-agent

Healthy· prod

99.6%

142 dec/min·14 behaviors

research-agent

Healthy· prod

99.9%

38 dec/min·8 behaviors

code-agent

Healthy· prod

99.4%

95 dec/min·12 behaviors

billing-agent

Healthy· prod

99.8%

23 dec/min·6 behaviors

support-agent

Alerting· prod

2.4%

142 dec/min·14 behaviors

research-agent

Healthy· prod

0.1%

38 dec/min·8 behaviors

code-agent

Warning· staging

1.2%

95 dec/min·12 behaviors

billing-agent

Healthy· prod

0.2%

23 dec/min·6 behaviors

retrieval-agent

Healthy· prod

99.7%

210 dec/min·10 behaviors

summarizer-agent

Healthy· prod

99.5%

62 dec/min·5 behaviors

onboarding-agent

Healthy· prod

100%

14 dec/min·4 behaviors

moderation-agent

Healthy· prod

99.9%

76 dec/min·9 behaviors

FIG.7

Find a behavior to promote..

BEHAVIOR

stale-context-drift

support-agent
CI GATE

512 traces

PASS 498 / 512

Production traces replayed
512
Regressions caught pre-merge
14
Mean time to detect
1m 12s
Behavior coverage
94.3%

Reliability uplift

+2.7 pts

Ship with confidence

Replay failures, then gate them in CI.

1.1Replay failed runs against any candidate fixSpin up a swarm to re-run the exact production trajectory against your new prompt, tool, or model.
1.2Promote evals straight into CITurn any captured failure into a regression test that blocks merges before they ship.
1.3Direct support from our applied research teamEmbedded engineers help you stand up evals, judges, and rubrics tailored to your agent.

Research. We’re an applied-research lab solving last-mile agent reliability.

Discover
Get started

Continuously improveyour own agents.