Infrastructure that fails your agent before your users do.


Keystone runs your agents in Prod Replica Sandboxes. Replay every trajectory, score every run, and gate every deploy. No CI rewrites required.
Spec, run, evaluate. Repeat.
Polarity helps you ship AI agents from prototype to production and beyond. Once in production we power your continuous improvement loop using real eval data to make your agents and LLM applications ever more reliable.
Tool calls, bytes, escapes. Captured.
Every tool call, every byte read, every CPU cycle the agent spends. And every attempt to leave the sandbox. Because the agent runs inside Keystone, it's all replayable, queryable, and gated by your invariants.
[Keystone.]
Capturing trace...
Author, fan-out, stream. Every byte.
A few lines of TypeScript and you're scoring agent runs across thousands of hermetic sandboxes, with full traces, scoring, and replay. No CI rewrites. No infrastructure to babysit.
1import { Keystone } from "polarity-keystone"23const ks = new Keystone()45// Describe what success looks like6const spec = await ks.specs.create({7 id: "rest-api-eval",8 task: "Build an Express server",9 invariants: [10 { type: "file_exists", path: "server.js" },11 { type: "command_exit", command: "npm test" },12 { type: "llm_as_judge", rubric: "Both routes return 200" },13 ],14})1516// Run 1000 replicas in parallel, fan-out is one flag17const run = await ks.experiments.runAndWait({18 specId: spec.id,19 replicas: 1000,20})2122// Stream every trace event: tool calls, scores, system metrics23for await (const ev of ks.traces(run.id)) {24 console.log(ev.type, ev.name, ev.cpuMs)25}2627console.log(run.passRate, run.composite, run.traceUrl)
Keystone, the sandbox built for AI agents at scale. Purpose-built for evals and observability.
Agent traces are deeply nested and rapidly mutating. Traditional CI can't spin up thousands of hermetic environments or replay complex trajectories. Keystone is purpose-built for agent evaluation. Run fleets of sandboxes and query millions of traces instantly.
Spec, replay, dataset. Every feature you need, in the tools you already use.
Don't take it from us. Listen to our customers.

“Our engineering team is very lean, and the Paragon product is instrumental to us shipping fast and maintaining a high bar of code quality and security.”

“Switching to Paragon has been an incredible experience. It is fast, accurate and does more than the competitors. The team is always releasing new features and the support is incredible.”

“Paragon's proactive bug detection cut our production issues by 40% and doubled our developer onboarding speed. The seamless CI/CD integration made it a frictionless addition to our workflow.”
