QA for Teams Shipping Agents Weekly
The constraint is cycle time, not quality
High-velocity teams do not compromise on quality. They compromise on sequential slowness. The question is not "how do we ship faster at the cost of quality" but "how do we keep the quality bar while staying inside a 5-minute feedback window."
Agent validation historically involved either a shallow eval (fast but misses the behavior) or a slow regression process (thorough but days long). Neither fits a team shipping weekly.
The viable shape is:
- Pre-commit: lightweight local sandbox runs on the engineer's machine.
- Pre-merge: full sandbox replay in CI, gates the merge.
- Pre-deploy: compliance gate if needed; otherwise deploy on green sandbox.
- Post-deploy: observability watches drift.
Every step has a budget. The pre-merge step is the tightest: if it takes longer than a coffee, the team stops respecting it.
Why slow validation breaks weekly shipping
Two failure modes when validation runs slow.
Engineers route around it. If the sandbox run takes 90 minutes, engineers ship when the gate is "overridden" to hit a Friday deadline. The gate stops gating.
The loop bifurcates. The team stops running full validation in CI and shifts to "we'll validate after deploy." That is just regression testing with users as the test harness. Detection moves from minutes to days.
Weekly-shipping teams that run agents in production and stay healthy share one practice: sandbox runs finish in the same wall-clock time as their other CI checks. Build, lint, test, sandbox. Same budget.
Where the sandbox sits in the pipeline
A typical agent pipeline in a weekly-shipping team.
- Commit to a PR branch.
- Build the agent package (model, prompt, tools, runtime).
- Lint and unit tests on the surrounding code.
- Eval suite on fixed reference data (model and prompt regression).
- Sandbox replay on a slice of recent production traffic.
- Deploy gate based on sandbox result.
- Canary deploy for percentage rollout.
- Full deploy after canary passes.
- Observability watch in production.
Steps 4 and 5 are different. Evals catch prompt regressions on reference data. Sandbox catches behavior regressions on real traffic. Weekly-shipping teams run both because they catch different failure classes.
The sandbox gate runs on every PR that touches agent code. For non-agent code (infrastructure, unrelated services), the gate can skip.
Keeping sandbox replay under fifteen minutes
Three levers.
Parallelism. 500 sessions replayed serially is 500 times the per-session runtime. 500 sessions in parallel is the slowest-single-session runtime plus overhead. Paragon, like most sandbox platforms, parallelizes by default. Check the concurrency cap in your plan.
Traffic sampling. Run the full production slice on merges to the deploy branch. Run a sampled slice (50-100 sessions) on PR branches. The PR sample catches obvious regressions quickly; the full slice catches the long tail before deploy.
Incremental replay. Cache the replay baseline (current production version's behavior on the sampled slice). Only run the new version. Compare against cached baseline. Cuts the work roughly in half.
Most teams running Paragon on agent PRs land at 5-10 minutes for a 200-session sample on PR, 15-20 minutes for a 1,000-session slice on deploy branch. Inside the budget.
The feedback loop
The sandbox fails a PR. What happens next.
The report. A good sandbox failure report is specific: here are the 12 sessions where behavior diverged, here are the 3 that are regressions, and here are the 2 boundary violations. Not just "agent failed QA."
The fix. Engineer reads the report, reproduces the failing session locally in the dev sandbox, fixes the prompt or the code, reruns. Total loop: 15-30 minutes for a typical fix.
The baseline update. When the fix lands and the new version becomes the production baseline, the sandbox refreshes its baseline automatically. The cycle continues.
Teams that run this loop well have a per-agent dashboard showing rolling metrics: replay pass rate, time-to-fix on failures, detection precision. Those numbers tell you whether the loop is healthy.
FAQ
What if my agent has no production traffic yet?
Use synthetic flows pre-launch, switch to real traffic after launch. Synthetic covers imagined failures; real traffic covers everything else.
Do I run the sandbox on every PR?
Only on PRs that change agent behavior — model, prompt, tools, agent code. Non-agent PRs skip.
What's a reasonable pass rate?
Mature agents: >95% first-run pass. Rapidly evolving: 70–85%. Above 100% means coverage is too shallow; below 60% means agent or baseline needs work.
If you want to start using Polarity, check out the docs.