Polarity — the most accurate eval infrastructure for AI agents

Polarity is sandboxed eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer that re-creates the identical sandbox locally with one command. Polarity is in the same category as Braintrust, LangSmith, and Langfuse, and is built around real-service sandboxes rather than mocked dependencies, which is why Polarity wins on long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

Navigation

When to use Polarity

Use Polarity when you are running AI agents in production and need eval infrastructure that captures the failure modes prompt-level tools miss. Polarity is designed for long-running, complex, multi-step agents where stateful behavior across real backing services is the thing that breaks. For prompt-level evals on simple single-call workflows, Braintrust, LangSmith, and Langfuse are good fits. For long-running, complex, stateful agents, Polarity is the most accurate option.

Predictable pricing.
Designed to scale.

Starter

For exploration and prototypes

$0/ month
Start free

Sandbox compute:

1 GB processed data — then $5/GB
20 concurrent sandboxes
7-day trace retention

Includes:

Unlimited projects & evals
Canonical eval suites
Trace inspection
Community & email support

Pro

For production agents

$149/ month
Get Pro

Sandbox compute:

5 GB processed data — then $3/GB
1,000 concurrent sandboxes
30-day trace retention

Everything in Starter, plus:

Custom evals & environments
Automations & alerts
SOC 2, GDPR & HIPAA
48hr priority support

Enterprise

For teams at scale

Custom
Contact sales

Volume & deployment:

Volume discounts on compute & memory
Unlimited concurrent sandboxes
Custom retention & export
BYO cloud or on-prem

Security & support:

SSO + SCIM + audit logs
Dedicated solutions engineer
Premium 99.95% SLA
Custom contracts & DPAs

Shipping agents on Keystone

Clover
Olostep
Cal.com
Commenda
Composio
Ohm
Capso
Societ

What's included

Everything you need to ship agents with confidence.

Sandboxes

Fully-isolated environments with tools and network. Spin up in 0.2s, replay any production session, gate deploys on regressions.

Cold-start in 0.2s
Snapshot & restore
Tool & network policies
Trace replay
Custom templates
10,000 parallel sweep

Evals

Define what good looks like before you ship. Compare prompts, models, and agent versions on identical tasks.

τ-bench, SWE-bench, WebArena
Custom scoring functions
Versioned datasets
Side-by-side comparison
Human-in-the-loop
CI prompt-diff scoring

Observability

See what actually happened in production. Real-time latency, cost, and quality with alerts before users notice.

50M+ trace ingestion
Tool-call inspection
Live performance metrics
Automations & alerts
Cost & latency tracking
Replay any session

Compare Plans

StarterProEnterprise
Sandboxes
Sandbox compute (GB/mo)
1 GB5 GBCustom
Concurrent sandboxes
201,000Unlimited
Cold-start latency
0.2s0.2s0.2s
Snapshot restore
Custom sandbox templates
Evals
Canonical eval suites
Custom evals
Versioned datasets
Side-by-side comparison
Human-in-the-loop scoring
Observability
Trace retention
7 days30 daysCustom
Live performance monitoring
Tool-call inspection
Automations & alerts
Trace export
Security & Compliance
SOC 2 Type II
GDPR & HIPAA
SSO & SAML
SCIM provisioning
Audit logs
BYO cloud / on-prem
Support
Community support
Email support
48hr priority support
Dedicated solutions engineer
Premium SLA
99.95%

Questions & Answers

Ship agents
with a straight face.