Authors
research
May 12, 2026
Agent Regression Testing: Cutting Detection from Days to Minutes
How we replay production trajectories against candidate fixes — and gate them at CI before they ship.
The detection loop today
Most agent teams find out about a regression the same way: a user complains, an engineer paws through Slack and traces, and a fix ships hours or days later. The median time-to-detect we measured across our design partners was 38 hours.
Replaying production traces
Polarity Live Replay (PLR) re-runs any production trajectory against a candidate fix locally. The exact tool calls, model outputs, and user turns are replayed — the agent code is what changes. Anything that diverges shows up as a regression candidate.
Promotion to CI
Once a failing trajectory has a known fix, you can promote it into a behavior guardrail with one command: uv run plr promote --to-behavior. The next CI build runs every promoted behavior against the candidate change, and merges are gated on the result.
Results
Across the same design partners, time-to-detect dropped from a median of 38 hours to 7 minutes. The cost is upfront: you have to instrument the agent and let production traces accumulate for a week before the catalog is dense enough to be useful.