Infrastructure that fails your agent before your users do.

Book a Demo
Sandboxed in 200ms.Scales to fleets.Evals built in.
ComposioCal.comClover
Vercel

Keystone runs your agents in Prod Replica Sandboxes. Replay every trajectory, score every run, and gate every deploy. No CI rewrites required.

Spec, run, evaluate. Repeat.

Polarity helps you ship AI agents from prototype to production and beyond. Once in production we power your continuous improvement loop using real eval data to make your agents and LLM applications ever more reliable.

SpecSandboxRunScoreReplay

Tool calls, bytes, escapes. Captured.

Every tool call, every byte read, every CPU cycle the agent spends. And every attempt to leave the sandbox. Because the agent runs inside Keystone, it's all replayable, queryable, and gated by your invariants.

[Capture]
[Score]
[Replay]

[Keystone.]

Capturing trace...

Author, fan-out, stream. Every byte.

A few lines of TypeScript and you're scoring agent runs across thousands of hermetic sandboxes, with full traces, scoring, and replay. No CI rewrites. No infrastructure to babysit.

Copy
1import { Keystone } from "polarity-keystone"23const ks = new Keystone()45// Describe what success looks like6const spec = await ks.specs.create({7  id: "rest-api-eval",8  task: "Build an Express server",9  invariants: [10    { type: "file_exists", path: "server.js" },11    { type: "command_exit", command: "npm test" },12    { type: "llm_as_judge", rubric: "Both routes return 200" },13  ],14})1516// Run 1000 replicas in parallel, fan-out is one flag17const run = await ks.experiments.runAndWait({18  specId: spec.id,19  replicas: 1000,20})2122// Stream every trace event: tool calls, scores, system metrics23for await (const ev of ks.traces(run.id)) {24  console.log(ev.type, ev.name, ev.cpuMs)25}2627console.log(run.passRate, run.composite, run.traceUrl)

Keystone, the sandbox built for AI agents at scale. Purpose-built for evals and observability.

Agent traces are deeply nested and rapidly mutating. Traditional CI can't spin up thousands of hermetic environments or replay complex trajectories. Keystone is purpose-built for agent evaluation. Run fleets of sandboxes and query millions of traces instantly.

51x
Faster sandbox cold boot
Competition11,024 ms
Keystone214 ms
15.7x
Faster service spin-up
Competition8,200 ms
Keystone520 ms
8.7x
Faster fleet warmup
Competition3,420 ms
Keystone392 ms

Spec, replay, dataset. Every feature you need, in the tools you already use.

spec · rest-api-eval
Author a Keystone spec with invariants for this Express API
Unified observability
YAML or TypeScript. Services, fixtures, secrets, network policy, audit hooks, invariants — one file the agent runs against.
Read the spec
run · 3a8f9e4
Replay this run and bisect to the failing tool call
Trace replay · step 7
0:42
Why did the agent skip file_exists here?
Programmable evals
Bisect failures by replaying any tool call. Per-action snapshots, deterministic re-runs, and a diff view against your last green run.
Open replay
Find a dataset
Failed runs (last 24h)
Invariant breaks
High-latency traces
Golden trajectories
Create new dataset
Complex datasets
Promote any failing trajectory into a regression test with one click. Build eval datasets from production failures, not synthetic examples.
Build datasets
Improve my agent prompt based on Keystone evals
Keystone MCPSQL query
SELECT * FROM runs WHERE specId = 'rest-api-eval' AND score < 0.5 ORDER BY created DESC LIMIT 20
Keystone MCPFetching experiment scores
Found 20 low-scoring runs. Common failure: agent skips the file_exists invariant before running the SQL check.
I found a pattern in your low-scoring runs. 18 of 20 failures occur when the agent skips the precondition step — it goes straight to the SQL check without verifying that server.js exists. Adding an explicit file_exists step before command_exit, and grounding the next action in that result, should fix most of these. Here's the updated spec:
MCP server
Query traces, run experiments, and gate deploys directly from Claude, Cursor, or your editor. Keystone exposes its full API as MCP tools.
Set up MCP
Polarity
Framework agnostic
Works with any stack you're already using. No framework lock-in, no rewrites, no vendor SDKs to manage.
View integrations
eval.ts
TypeScript
1import { Keystone } from "polarity-keystone"2 3// 1 spec → 1000 sandboxes4const run = await ks.runAndWait({5  specId: "rest-api-eval",6  replicas: 1000,7})
Native SDKs
TypeScript, Python, Go, Ruby, C#, Java. Spin up a sandbox, attach any agent, score the run, and stream every event back into your code in a few lines.
Read SDK docs

Try Polarity today.

Book a Demo