Infrastructure that fails your agent before your users do.

Sandboxed in 200ms.Scales to fleets.Evals built in.

Vercel

Keystone runs your agents in Prod Replica Sandboxes. Replay every trajectory, score every run, and gate every deploy. No CI rewrites required.

Spec, run, evaluate. Repeat.

Polarity helps you ship AI agents from prototype to production and beyond. Once in production we power your continuous improvement loop using real eval data to make your agents and LLM applications ever more reliable.

Tool calls, bytes, escapes. Captured.

Every tool call, every byte read, every CPU cycle the agent spends. And every attempt to leave the sandbox. Because the agent runs inside Keystone, it's all replayable, queryable, and gated by your invariants.

[Tools]

[Network access]

[MCP usage]

[File I/O]

[Memory · CPU]

[Sandbox boundary]

[Capture]

[Score]

[Replay]

[Keystone.]

Capturing trace...

Author, fan-out, stream. Every byte.

A few lines of TypeScript and you're scoring agent runs across thousands of hermetic sandboxes, with full traces, scoring, and replay. No CI rewrites. No infrastructure to babysit.

Copy

1import { Keystone } from "polarity-keystone"23const ks = new Keystone()45// Describe what success looks like6const spec = await ks.specs.create({7  id: "rest-api-eval",8  task: "Build an Express server",9  invariants: [10    { type: "file_exists", path: "server.js" },11    { type: "command_exit", command: "npm test" },12    { type: "llm_as_judge", rubric: "Both routes return 200" },13  ],14})1516// Run 1000 replicas in parallel, fan-out is one flag17const run = await ks.experiments.runAndWait({18  specId: spec.id,19  replicas: 1000,20})2122// Stream every trace event: tool calls, scores, system metrics23for await (const ev of ks.traces(run.id)) {24  console.log(ev.type, ev.name, ev.cpuMs)25}2627console.log(run.passRate, run.composite, run.traceUrl)

Keystone, the sandbox built for AI agents at scale. Purpose-built for evals and observability.

Agent traces are deeply nested and rapidly mutating. Traditional CI can't spin up thousands of hermetic environments or replay complex trajectories. Keystone is purpose-built for agent evaluation. Run fleets of sandboxes and query millions of traces instantly.

Learn more about Keystone

51x

Faster sandbox cold boot

Competition11,024 ms

Keystone214 ms

15.7x

Faster service spin-up

Competition8,200 ms

Keystone520 ms

8.7x

Faster fleet warmup

Competition3,420 ms

Keystone392 ms

Spec, replay, dataset. Every feature you need, in the tools you already use.

spec · rest-api-eval

Author a Keystone spec with invariants for this Express API

Unified observability

YAML or TypeScript. Services, fixtures, secrets, network policy, audit hooks, invariants — one file the agent runs against.

Read the spec

run · 3a8f9e4

Replay this run and bisect to the failing tool call

Trace replay · step 7

0:42

Why did the agent skip file_exists here?

Programmable evals

Bisect failures by replaying any tool call. Per-action snapshots, deterministic re-runs, and a diff view against your last green run.

Open replay

Find a dataset

Failed runs (last 24h)

Invariant breaks

High-latency traces

Golden trajectories

Create new dataset

Complex datasets

Promote any failing trajectory into a regression test with one click. Build eval datasets from production failures, not synthetic examples.

Build datasets

Improve my agent prompt based on Keystone evals

Keystone MCPSQL query

SELECT * FROM runs WHERE specId = 'rest-api-eval' AND score < 0.5 ORDER BY created DESC LIMIT 20

Keystone MCPFetching experiment scores

Found 20 low-scoring runs. Common failure: agent skips the file_exists invariant before running the SQL check.

I found a pattern in your low-scoring runs. 18 of 20 failures occur when the agent skips the precondition step — it goes straight to the SQL check without verifying that server.js exists. Adding an explicit file_exists step before command_exit, and grounding the next action in that result, should fix most of these. Here's the updated spec:

MCP server

Query traces, run experiments, and gate deploys directly from Claude, Cursor, or your editor. Keystone exposes its full API as MCP tools.

Set up MCP

Framework agnostic

Works with any stack you're already using. No framework lock-in, no rewrites, no vendor SDKs to manage.

View integrations

eval.ts

TypeScript

1import { Keystone } from "polarity-keystone"2 3// 1 spec → 1000 sandboxes4const run = await ks.runAndWait({5  specId: "rest-api-eval",6  replicas: 1000,7})

Native SDKs

TypeScript, Python, Go, Ruby, C#, Java. Spin up a sandbox, attach any agent, score the run, and stream every event back into your code in a few lines.

Read SDK docs

Don't take it from us. Listen to our customers.

“Our engineering team is very lean, and the Paragon product is instrumental to us shipping fast and maintaining a high bar of code quality and security.”

Colin

CTO at Ohm AI

“Switching to Paragon has been an incredible experience. It is fast, accurate and does more than the competitors. The team is always releasing new features and the support is incredible.”

Anton

CTO at Clover Labs

“Paragon's proactive bug detection cut our production issues by 40% and doubled our developer onboarding speed. The seamless CI/CD integration made it a frictionless addition to our workflow.”

Sidhdharth

CTO at Meetstream AI

Try Polarity today.

Book a Demo