How to Test AI Agents in a Sandbox Before Production

by Shane Barakat··6 min read
How to Test AI Agents in a Sandbox Before Production

Why pre-deploy testing matters

Atlan's 2026 testing guide puts agent project failure in production at 80-90%, with missing validation infrastructure as the dominant cause. Evals alone miss the failure modes that actually ship: wrong-path tool calls, drift on live traffic, multi-step hallucinations, and boundary escapes that emerge from combinations nobody wrote a test for.

The fix is pre-deploy testing against real or replayed traffic, inside a sandbox that records what the agent actually does. Not output scoring on a fixed dataset. Behavior verification on traffic that looks like production.

The five-step workflow

Step 1: Plug the agent into a sandbox

Point the agent at the sandbox runtime instead of your production model and tool endpoints. Any standard model API and tool interface works. Paragon, for example, accepts agents that use OpenAI-compatible APIs, function-calling interfaces, or MCP. Integration is closer to a day than a week for teams on standard agent frameworks.

What you get: an isolated environment that records every tool call, browser action, and multi-step trajectory the agent takes.

Step 2: Declare the tools and the policy

List the tools the agent can call. Write a plain policy: which tools, with what arguments, in what contexts, and what the agent is explicitly not allowed to do. This does not need to be exhaustive on day one. Start with the obvious constraints (no destructive actions on shared resources, no cross-tenant data access) and expand as you see what the agent actually does.

The policy is what the sandbox checks at tool-call time. Weak policy means weak enforcement. Microsoft's 2026 Foundry guidance covers structured tool-invocation schemas and just-in-time authorization as the pattern most teams are converging on.

Step 3: Replay production traffic through the sandbox

Pull a representative slice of recent real traffic. Tens of sessions for a new feature, hundreds or thousands for a mature agent. The sandbox runs each session through the agent version you are about to ship and records everything.

Avoid synthetic traffic for this step. Synthetic traffic misses the long-tail combinations that cause boundary escapes. You want real traffic with real users, messy data, and edge cases you did not think of.

Step 4: Compare behavior to a baseline

The sandbox computes the delta between the new version and the baseline (usually the current production version, or a prior known-good version).

What the comparison surfaces:

  • Tool-call trajectory changes (new tools called, different order, extra calls)
  • Boundary violations (calls outside declared policy)
  • Hallucinations that propagate across workflow steps
  • Drift in response distribution or retrieval quality

A clean diff means the new version behaves like the old one, only presumably better where you changed it. A dirty diff gives you a specific list of where behavior changed and why.

Step 5: Gate the deploy on the result

Three possible outcomes.

Pass. No regressions, no boundary violations, trajectory matches baseline within tolerance. Ship it.

Fail. Specific issues found. Fix them, rerun. Do not override the gate because the pipeline is waiting; the gate exists because your CI cannot see agent behavior.

Review. Borderline results that need human judgment. Route the sandbox report to the right reviewer with the specific changed behavior highlighted.

Paragon, like most agent sandbox products, ships CI integrations so step 5 happens automatically on every merge to the deploy branch.

What to check at each step

StepPrimary checkCommon miss
1. Plug inAgent successfully routes through sandbox, tool calls interceptedMisconfigured endpoint leaves some calls hitting prod
2. DeclarePolicy covers real constraints, not just happy pathToo permissive, so boundary checks never trip
3. ReplayTraffic is representative of production, not cherry-pickedSynthetic-only replay misses edge cases
4. CompareBaseline is current prod, not an old snapshotComparing against a stale baseline hides drift
5. GatePass/fail/review is explicit, CI blocks on fail"No issues found" accepted without reading the report

Common mistakes

  • Using synthetic traffic only. Synthetic coverage catches author-imagined failures. Real traffic catches everything else.
  • Declaring overly broad policy. If the agent has permission to do something it should not, the sandbox will not flag it. Policy declarations want to be specific.
  • Ignoring drift on the baseline. Your baseline is a moving target. Refresh it periodically against recent production behavior.
  • Treating pass as deployment approval. A pass on the sandbox means no behavioral regressions. It does not mean the change is a good idea. Human review still applies.

FAQ

How much traffic do I need to replay?

Cover each user journey ~10x. Narrow agents: ~50 sessions. General-purpose: a few thousand.

What if there's no baseline yet?

The first run sets it. Start with policy violations and obvious failures; regression comparisons begin on v2.

Can I automate it in CI?

Yes. Trigger a sandbox run on every agent-version PR. GitHub Actions integration is built-in.

How long does a run take?

Seconds per short workflow, minutes for long ones. 500 sessions in parallel runs in 5–15 minutes.

If you want to start using Polarity, check out the docs.

Try Polarity today.