Authors
research
April 9, 2026
Behavior Discovery: Surfacing Failure Modes from Unlabeled Traces
How we cluster unlabeled production trajectories into recurring behaviors — without an eval suite to compare against.
The labeling bottleneck
Production trace volumes are enormous and most teams don’t have the bandwidth to triage every failed trace, much less label them. The behaviors that bite users are the long tail.
Method
We cluster trajectories by decision pattern — the shape of the tool calls, the relationship between turns — rather than by token similarity. Clusters become candidate behaviors a human can name.
What we found
Across one quarter of partner data we found 47 recurring behaviors; 14 of them turned into shipped behavior monitors.