research
Polarity Research
Notes from the applied-research team working on last-mile agent reliability — behavior detection, trajectory replay, and the evaluation stack that gets agents to ship.
The 80% Boundary Problem: Why Agents Escape Their Guardrails
The category of failures most evals miss — and how Polarity's invariants catch them before production.
Agent Regression Testing: Cutting Detection from Days to Minutes
How we replay production trajectories against candidate fixes — and gate them at CI before they ship.
Introducing the Polarity Agent Sandbox
A scoped, reproducible environment for replaying agents against real production data — without touching real users.
Agent Search: Querying Trajectories at a Behavioral Level
Beyond filters and keyword search — find the trajectory you actually want, by what the agent did.
Agent Judge: Cheaper, More Accurate Trajectory-Level Evaluators
Harness-based judges that grade an entire trajectory at a fraction of the cost of standard LLM-as-judge.
Behavior Discovery: Surfacing Failure Modes from Unlabeled Traces
How we cluster unlabeled production trajectories into recurring behaviors — without an eval suite to compare against.