To spin up a fully-isolated sandbox with tools and network.
Agent trajectories recorded, evaluated, and replayable.
Faster time-to-production for agent teams shipping on Keystone.
Sandboxing, observability, and evals for every team. From engineering to product, one platform.
QA for Agents
Reproduce real production failures in an isolated sandbox. Replay user sessions, inspect every tool call, and ship the fix with confidence.
Benchmarking
Run every agent version through the same canonical suites — τ-bench, SWE-bench, WebArena, or your own. Compare versions, models, and prompt changes on identical tasks.
Observability
See what actually happened in production. Inspect every trace, drill into tool calls, and track latency, cost, and quality in real-time. Get alerts before your users notice something's wrong.
Evals
Define what good looks like before you ship. Compare prompts side-by-side and catch regressions automatically in CI.
A few lines and you're running. Bring any agent. Keystone handles the infra.
1from keystone import Keystone23ks = Keystone(api_key="ks_live_...")45sandbox = ks.sandboxes.create(6 spec_id="travel-agent",7 timeout="10m",8)910exp = ks.experiments.create(11 eval_id="travel-booking",12 sandbox_spec="travel-agent",13)14report = ks.experiments.run(exp.id)15print(report.pass_rate, report.trace_url)Keystone, the sandbox built for AI agents at scale. Purpose-built for evals and observability.
Agent traces are deeply nested and rapidly mutating. Traditional CI can't spin up thousands of hermetic environments or replay complex trajectories. Keystone is purpose-built for agent evaluation — run fleets of sandboxes and query millions of traces instantly.
Principal Engineer, Top-5 Consumer AI App
“Before Keystone, shipping an agent change took a week of eyeballing logs. Now we know it’s better in an hour.”Read more customer stories
Parallel sandboxes per benchmark sweep — linear cost, no warmup.
Of regressions surfaced pre-merge. Shipped failures dropped 14×.
From agent change to graded eval. Replaces multi-week QA cycles.
Enterprise-ready. VPC deployment, SSO, audit logs, and compliance baked in from day one.
Predictable pricing. Designed to scale.
Starter
For exploration
Pro
For production agents
Enterprise
For teams at scale



