Sandboxed environments
for testing AI agents.

Clover
Olostep
Cal.com
Commenda
Composio
Ohm
Capso
Societ
0.2s

To spin up a fully-isolated sandbox with tools and network.

50M+

Agent trajectories recorded, evaluated, and replayable.

4.2x

Faster time-to-production for agent teams shipping on Keystone.

Sandboxing, observability, and evals for every team. From engineering to product, one platform.

Keystone · agent-core · Sandboxes
Default viewFilterDisplaySearch sandboxesLIVEPast 1h
Started
Sandbox
Pass
Cov
Duration
CPU
Memory
2:41 PMrepro · ticket-8421 cart timeout100%87.3%2.1s1.8512
2:39 PMreplay · session 9f2a1 checkout100%91.0%1.6s0.9384
2:37 PMbenchmark · tau-retail 250 tasks92%3m 12s14.24,096
2:34 PMwarm fleet · 50 × a100 sandboxes50/50214ms28.48,192
2:31 PMeval · agent-core v4.2 regression96.4%94.1%42s6.21,024
2:28 PMpytest · tools/retrieval suite184/18489.2%18s2.4768
2:25 PMboot · shopify-checkout snapshot196ms0.4256
2:22 PMtrace replay · order-flow 88012.4s1.1412
2:19 PMghost run · candidate-7 shadow88.1%82.4%1m 08s5.82,048
2:16 PMfuzz · prompt-injection harness412/412100%2m 46s8.41,536
2:13 PMtool audit · stripe webhook path0.3s0.2128
2:10 PMdeploy gate · agent-core v4.3100%93.0%51s4.61,280
2:07 PMreplay · refund-agent session 42100%85.7%1.2s0.8340
2:04 PMload test · 1k concurrent agents4m 18s96.232,768
2:01 PMeval · multi-turn consistency v290.6%88.4%1m 22s3.1960
1:58 PMsnapshot restore · retrieval-v12320ms0.6640
1:55 PMreplay · billing-agent #7712100%79.1%1.8s1.2512

QA for Agents

Reproduce real production failures in an isolated sandbox. Replay user sessions, inspect every tool call, and ship the fix with confidence.

Replay any session from prod
Tool-call and trace inspection
Gate deploys on regressions
Reproduce your first bug
Keystone · agent-core · SQL sandbox
Low accuracy spans
High latency queries
Error analysis
Run
SELECT
span_id,
input,
output,
scores->>'Accuracy' as accuracy,
latency_ms
FROM logs
WHERE created >= now() - interval '7 days'
AND scores->>'Accuracy' < 0.8
ORDER BY created DESC
span_id
input
output
accuracy
latency_ms
a3f8c1d2What's the status of order #45821?I'll check that for you. Your order is currently…0.72892
b7e2a9c4Cancel my subscription pleaseI understand you want to cancel. Before I process…0.651247
c5d1f8e3Do you offer bulk discounts?Yes, we offer volume pricing for orders over…0.78634
d9a4b2c7My item arrived damagedI'm sorry to hear that. Let me help you with…0.711089
e2c8d5f1How do I update payment method?To update your payment method, go to Account…0.69543

Benchmarking

Run every agent version through the same canonical suites — τ-bench, SWE-bench, WebArena, or your own. Compare versions, models, and prompt changes on identical tasks.

Canonical and custom suites
Side-by-side agent comparison
Prompt-diff scoring on CI
Run a benchmark
Total LLM cost
$40.00$30.00$20.00$10.00$0.00Mon 2012 PMTue 2112 PMWed 2212 PM
Total$1,104.00
Completion$271.18
Prompt (cache write)$421.34
Prompt (cache read)$206.06
Prompt$136.13
Reasoning$69.29

Observability

See what actually happened in production. Inspect every trace, drill into tool calls, and track latency, cost, and quality in real-time. Get alerts before your users notice something's wrong.

Scalable trace ingestion
Live performance monitoring
Automations and alerts
Log your first trace
Agent comparison · τ-bench
Agent-1
Agent-1-v2
Agent-2
Agent-2-v2
% Score diff per edit
% Score diff
% Tool usage
% Accuracy
52.51%
22
58.44%
33
100%
2
87.3%
11
19.61%
22
37.72%
22
75%
1
92.1%
2
28.8%
22
53.97%
22
99.6%
85.0%
1
19.84%
22
37.08%
22
75%
1
78.2%
2
14.7%
36.75%
100%
95.0%
37.0%-22.3%
37.0%-0.3%
99.8%
94.5%-0.5%
16.2%-1.5%
8.1%+28.7%
99.5%
96.2%+1.2%
29.5%-14.8%
44.3%-7.6%
98.6%
91.0%-4.0%
94.5%
94.5%
100%
100%
31.0%+63.5%
93.1%+1.4%
93.8%
98.5%-1.5%
31.7%+62.8%
95.0%-0.5%
96.1%
99.2%-0.8%
0.0%+94.5%
0.0%+94.5%
0.0%+100.0%1
0.0%+100.0%1

Evals

Define what good looks like before you ship. Compare prompts side-by-side and catch regressions automatically in CI.

Fast prompt engineering
Flexible, versioned datasets
Automated and human scoring
Run your first eval

A few lines and you're running. Bring any agent. Keystone handles the infra.

1from keystone import Keystone
2
3ks = Keystone(api_key="ks_live_...")
4
5sandbox = ks.sandboxes.create(
6 spec_id="travel-agent",
7 timeout="10m",
8)
9
10exp = ks.experiments.create(
11 eval_id="travel-booking",
12 sandbox_spec="travel-agent",
13)
14report = ks.experiments.run(exp.id)
15print(report.pass_rate, report.trace_url)

Keystone, the sandbox built for AI agents at scale. Purpose-built for evals and observability.

Agent traces are deeply nested and rapidly mutating. Traditional CI can't spin up thousands of hermetic environments or replay complex trajectories. Keystone is purpose-built for agent evaluation — run fleets of sandboxes and query millions of traces instantly.

51x
Faster sandbox cold boot
Competition11,024 ms
Keystone214 ms
15.7x
Faster service spin-up
Competition8,200 ms
Keystone520 ms
8.7x
Faster fleet warmup
Competition3,420 ms
Keystone392 ms

Principal Engineer, Top-5 Consumer AI App

“Before Keystone, shipping an agent change took a week of eyeballing logs. Now we know it’s better in an hour.”
Read more customer stories
10,000

Parallel sandboxes per benchmark sweep — linear cost, no warmup.

96%

Of regressions surfaced pre-merge. Shipped failures dropped 14×.

1 hour

From agent change to graded eval. Replaces multi-week QA cycles.

Enterprise-ready. VPC deployment, SSO, audit logs, and compliance baked in from day one.

GDPR
SOC 2
HIPAA
ISO 27001

Predictable pricing. Designed to scale.

Starter

For exploration

$0 / month
1 GB processed data
+ $5/GB
20 concurrent
7 days retention
Usage
vCPU time+ $0.00003942/vCPU/sec
Memory+ $0.00000672/GB/s
Unlimited projects, evals, datasets, and experiments
Book a demo

Pro

For production agents

$149 / month
5 GB processed data
+ $3/GB
1,000 concurrent
30 days retention
Usage
vCPU time+ $0.00003942/vCPU/sec
Memory+ $0.00000672/GB/s
Custom evals, environments, and 48hr priority support
Book a demo

Enterprise

For teams at scale

Custom
Up to 50% off compute and memory, custom retention and export, BYO cloud or on-prem deployment, SSO + SCIM + audit logs, and premium SLA support for high-volume or privacy-sensitive agent workloads.
Contact sales

Ship agents
with a straight face.