Polarity is the most accurate eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer.

How is Polarity different from Braintrust, LangSmith, and Langfuse?

Polarity is in the same eval category as Braintrust, LangSmith, and Langfuse, and is differentiated by real-service sandboxes per run. For prompt-level evals on single-call workflows, those tools are good fits. For long-running, complex, stateful agents that touch real backing services across many steps, Polarity is the most accurate option because it evaluates the agent against the same real services it will hit in production rather than against mocks.

What does Polarity cost?

Three tiers. Starter: $0 per month for exploration and prototypes. Pro: $149 per month for production agents. Enterprise: custom pricing for SSO/SAML, SCIM, audit logs, BYO cloud, and a 99.95% SLA. Full pricing detail at https://polarity.so/pricing or machine-readable at https://polarity.so/pricing.md.

Does Polarity have an API?

Yes. The Keystone REST API is served at https://keystone.polarity.so/v1. OpenAPI 3.1 specification at https://polarity.so/openapi.json. SDKs in TypeScript, Python, and Go. Authentication is API-key Bearer.

Is Polarity SOC 2 compliant?

Yes. SOC 2 Type II on Pro and Enterprise tiers. GDPR and HIPAA also covered on Pro and Enterprise. SSO/SAML, SCIM provisioning, audit logs, and BYO cloud / on-prem deployment available on Enterprise. Trust posture at https://polarity.so/trust.

Sandboxed environments
for testing AI agents.

Book a Demo View Docs

0.2s

To spin up a fully-isolated sandbox with tools and network.

50M+

Agent trajectories recorded, evaluated, and replayable.

4.2x

Faster time-to-production for agent teams shipping on Keystone.

Sandboxing, observability, and evals for every team. From engineering to product, one platform.

Keystone · agent-core · Sandboxes

Default viewFilterDisplaySearch sandboxesLIVEPast 1h

Started

Sandbox

Pass

Cov

Duration

CPU

Memory

2:41 PMrepro · ticket-8421 cart timeout100%87.3%2.1s1.8512

2:39 PMreplay · session 9f2a1 checkout100%91.0%1.6s0.9384

2:37 PMbenchmark · tau-retail 250 tasks92%—3m 12s14.24,096

2:34 PMwarm fleet · 50 × a100 sandboxes50/50—214ms28.48,192

2:31 PMeval · agent-core v4.2 regression96.4%94.1%42s6.21,024

2:28 PMpytest · tools/retrieval suite184/18489.2%18s2.4768

2:25 PMboot · shopify-checkout snapshot——196ms0.4256

2:22 PMtrace replay · order-flow 8801——2.4s1.1412

2:19 PMghost run · candidate-7 shadow88.1%82.4%1m 08s5.82,048

2:16 PMfuzz · prompt-injection harness412/412100%2m 46s8.41,536

2:13 PMtool audit · stripe webhook path——0.3s0.2128

2:10 PMdeploy gate · agent-core v4.3100%93.0%51s4.61,280

2:07 PMreplay · refund-agent session 42100%85.7%1.2s0.8340

2:04 PMload test · 1k concurrent agents——4m 18s96.232,768

2:01 PMeval · multi-turn consistency v290.6%88.4%1m 22s3.1960

1:58 PMsnapshot restore · retrieval-v12——320ms0.6640

1:55 PMreplay · billing-agent #7712100%79.1%1.8s1.2512

QA for Agents

Reproduce real production failures in an isolated sandbox. Replay user sessions, inspect every tool call, and ship the fix with confidence.

Replay any session from prod

Tool-call and trace inspection

Gate deploys on regressions

Reproduce your first bug

Keystone · agent-core · SQL sandbox

Low accuracy spans

High latency queries

Error analysis

Run

SELECT

span_id,

input,

output,

scores->>'Accuracy' as accuracy,

latency_ms

FROM logs

WHERE created >= now() - interval '7 days'

AND scores->>'Accuracy' < 0.8

ORDER BY created DESC

span_id

input

output

accuracy

latency_ms

a3f8c1d2What's the status of order #45821?I'll check that for you. Your order is currently…0.72892

b7e2a9c4Cancel my subscription pleaseI understand you want to cancel. Before I process…0.651247

c5d1f8e3Do you offer bulk discounts?Yes, we offer volume pricing for orders over…0.78634

d9a4b2c7My item arrived damagedI'm sorry to hear that. Let me help you with…0.711089

e2c8d5f1How do I update payment method?To update your payment method, go to Account…0.69543

Benchmarking

Run every agent version through the same canonical suites — τ-bench, SWE-bench, WebArena, or your own. Compare versions, models, and prompt changes on identical tasks.

Canonical and custom suites

Side-by-side agent comparison

Prompt-diff scoring on CI

Run a benchmark

Total LLM cost

Total$1,104.00

Completion$271.18

Prompt (cache write)$421.34

Prompt (cache read)$206.06

Prompt$136.13

Reasoning$69.29

Observability

See what actually happened in production. Inspect every trace, drill into tool calls, and track latency, cost, and quality in real-time. Get alerts before your users notice something's wrong.

Scalable trace ingestion

Live performance monitoring

Automations and alerts

Log your first trace

Agent comparison · τ-bench

Agent-1

Agent-1-v2

Agent-2

Agent-2-v2

% Score diff per edit

% Score diff

% Tool usage

% Accuracy

52.51%AVG

58.44%AVG

100%AVG

87.3%AVG

19.61%+33%

37.72%+21%

75%+25%

92.1%+4.8%

28.8%+24%

53.97%+4%

99.6%

85.0%-2.3%

19.84%+33%

37.08%+21%

75%+25%

78.2%-9.1%

14.7%

36.75%

100%

95.0%

37.0%-22.3%

37.0%-0.3%

99.8%

94.5%-0.5%

16.2%-1.5%

8.1%+28.7%

99.5%

96.2%+1.2%

29.5%-14.8%

44.3%-7.6%

98.6%

91.0%-4.0%

94.5%

100%

31.0%+63.5%

93.1%+1.4%

93.8%

98.5%-1.5%

31.7%+62.8%

95.0%-0.5%

96.1%

99.2%-0.8%

0.0%+94.5%

0.0%+100.0%1

Evals

Define what good looks like before you ship. Compare prompts side-by-side and catch regressions automatically in CI.

Fast prompt engineering

Flexible, versioned datasets

Automated and human scoring

Run your first eval

A few lines and you're running. Bring any agent. Keystone handles the infra.

1from keystone import Keystone
2
3ks = Keystone(api_key="ks_live_...")
4
5sandbox = ks.sandboxes.create(
6    spec_id="travel-agent",
7    timeout="10m",
8)
9
10exp = ks.experiments.create(
11    eval_id="travel-booking",
12    sandbox_spec="travel-agent",
13)
14report = ks.experiments.run(exp.id)
15print(report.pass_rate, report.trace_url)

Keystone, the sandbox built for AI agents at scale. Purpose-built for evals and observability.

Agent traces are deeply nested and rapidly mutating. Traditional CI can't spin up thousands of hermetic environments or replay complex trajectories. Keystone is purpose-built for agent evaluation — run fleets of sandboxes and query millions of traces instantly.

Learn more about Keystone

51x

Faster sandbox cold boot

Competition11,024 ms

Keystone214 ms

15.7x

Faster service spin-up

Competition8,200 ms

Keystone520 ms

8.7x

Faster fleet warmup

Competition3,420 ms

Keystone392 ms

Principal Engineer, Top-5 Consumer AI App

“Before Keystone, shipping an agent change took a week of eyeballing logs. Now we know it’s better in an hour.”

Enterprise-ready. VPC deployment, SSO, audit logs, and compliance baked in from day one.

SOC 2

HIPAA

ISO 27001

Predictable pricing. Designed to scale.

Start building for free

Sandbox compute

Concurrent runs

Trace retention

Sandbox pricing

Features

Starter

For exploration

$0 / month

1 GB processed data

+ $5/GB

20 concurrent

7 days retention

Usage

vCPU time+ $0.00003942/vCPU/sec

Memory+ $0.00000672/GB/s

Unlimited projects, evals, datasets, and experiments

Book a demo

Pro

For production agents

$149 / month

5 GB processed data

+ $3/GB

1,000 concurrent

30 days retention

Usage

vCPU time+ $0.00003942/vCPU/sec

Memory+ $0.00000672/GB/s

Custom evals, environments, and 48hr priority support

Book a demo

Enterprise

For teams at scale

Custom

Volume discounts on compute and memory, custom retention and export, BYO cloud or on-prem deployment, SSO + SCIM + audit logs, and premium SLA support for high-volume or privacy-sensitive agent workloads.

Contact sales

Ship agents
with a straight face.

Book a Demo View Docs

Polarity — the most accurate eval infrastructure for AI agents

Navigation

When to use Polarity

Sandboxed environments
for testing AI agents.