Agent Regression Testing: Cutting Detection from Days to Minutes
The two detection paths
When an agent regresses, you find out one of two ways.
Slow path: ship the new version. Users interact with it. Some percentage of those interactions fail or behave worse. Users complain, file support tickets, or quietly churn. A support engineer escalates. You investigate. You confirm the regression. You roll back or patch. Median time from deploy to detection: two to four days. Worst case: weeks.
Fast path: replay a slice of recent production traffic through the proposed new version inside a sandbox. The sandbox compares behavior trajectory-by-trajectory to the current production version and surfaces the delta. Median time from proposed deploy to detection: five to fifteen minutes. The bad version never reaches a user.
Same regression. Different detection path. The choice of path is a design decision, not a quality ceiling.
The slow path: days to detect
The slow path is what most agent teams still do in 2026 because evals and observability cover the endpoints but miss the middle.
Uptime Robot's 2026 monitoring guide calls this out directly for drift and quality regressions: "no single request failing; the aggregate gets worse." By the time the aggregate is bad enough to notice, you have already served it to users for days.
The slow path has failure modes built into it:
- Delayed signal. Users do not always report regressions; they leave.
- Noisy signal. Complaints blame user error, data issues, or phase-of-the-moon before they blame a recent deploy.
- High cost to detect. A support engineer reproducing a regression takes hours. Confirming it took a product engineer an afternoon. Rolling back has its own tail of consequences.
For product-critical agents, days of detection is days of bad output reaching real users. That is the thing the fast path avoids.
The fast path: minutes to detect
The fast path gates the deploy on a sandbox run.
- Proposed agent version is built in CI.
- Sandbox pulls a recent slice of production traffic (tens to thousands of sessions depending on scale).
- Each session replays through the new version inside the sandbox. Every tool call, browser action, and workflow step is recorded.
- Sandbox compares behavior to the current production version. Regressions are flagged with the exact session and step that changed.
- Deploy gates on the result. Pass → ship. Fail → fix and rerun.
Typical sandbox run time for 500 sessions in parallel: 5 to 15 minutes. That is the detection window. No user interaction required. No support tickets. No rollback.
The detection quality is higher, too. Because the sandbox sees the full trajectory, it catches wrong-path regressions that the slow path would miss even with observability tools watching. A regression that triples tool-call count but produces the same output is invisible to observability on answer quality; the sandbox sees the trajectory change directly.
What makes the fast path work
Three ingredients.
Real traffic, not synthetic. Synthetic test cases cover what the author imagined. Production traffic covers what users actually do. Regressions live in the combinations nobody wrote a test for. You cannot shrink detection to minutes with synthetic traffic alone because your coverage has holes.
Behavioral validation, not output scoring. Output-scoring evals on a fixed dataset tell you the answer changed. They do not tell you the agent now makes five tool calls instead of two, or drifted outside policy on 3% of sessions. Behavioral validation sees the trajectory.
Automated comparison. The sandbox does the diff. "Here are the 12 sessions where the new version behaved differently, these three are regressions, these five are boundary violations." A human does not have to manually inspect each replay; they inspect the specific flagged deltas.
When all three are present, regression detection moves from a lagging signal (days) to a leading signal (minutes).
What you cannot shrink
Some regressions only appear at scale or over time.
- Long-tail distribution shifts. A regression that hits one in 50,000 sessions may not show up in a 500-session replay. Scale or time are the detection mechanisms for these.
- Drift that emerges only with fresh data. If the retrieval corpus changes weekly, a sandbox replay against yesterday's traffic will not catch a regression that manifests next week.
- Model-provider-side changes. If an upstream model provider quietly updates weights, a sandbox catching behavior changes is still the fastest mechanism, but it only fires when your CI runs.
The fast path shrinks the common case. Observability tools still watch for the long-tail cases in production. Both, working together, close the window.
FAQ
What's the baseline?
The current production version. The sandbox runs new and current against the same traffic slice and compares. First-time deploys get policy and failure checks only — regression comparison starts from v2 onward.
What traffic does it replay?
A representative slice from production logs. Narrow agents need ~50 sessions, general-purpose ones need 500–few thousand.
Does this replace observability?
No. Sandboxes catch common regressions pre-deploy; observability catches long-tail drift post-deploy.
Can I run it on every PR?
Yes. Full suite on deploy-branch merges, lighter version per PR. GitHub Actions integration ships out of the box.
If you want to start using Polarity, check out the docs.