Behavioral Hot Paths: Discrete, Cost-Aware Discovery of Recurring Agent Behavior
Language agents repeat themselves. A coding agent clones, builds, and tests; an analytics agent connects, queries, and formats. Each repetition is billed at full frontier cost even after the agent has performed the behavior thousands of times. Compiling a recurring behavior into a cheap specialist is the obvious response, but it requires first answering a question prior work leaves implicit: how does an agent discover, with no human labeling, which of its behaviors recur often enough that compiling them pays off?
Discovery has to be discrete
The tempting move is to embed each behavior as a vector and cluster by cosine distance, declaring two behaviors “the same” when they fall within a threshold. We argue that is the wrong primitive for a system whose headline quantity is the degree of repetition in a workload. Under vectorization, that number is set by an arbitrary encoder and an arbitrary cutoff: change either and “80% of work falls into ten behaviors” becomes a statement about the encoder rather than the agent.
So we keep every decision discrete and inspectable. Agent operations are typed: a tool call with structured arguments. We reduce each to a canonical signature that keeps its structure and abstracts away the volatile parts, so operations that should count as identical become literally equal rather than merely close. A behavior is then a frequent subsequence over the signature alphabet, recovered by classical pattern mining whose only knob is a support count with a clear meaning.
The method, in four discrete stages
- Canonical signatures. A map κ sends each typed operation to a signature, keeping structure and discarding volatile arguments, so equivalent operations become equal.
- Frequent-pattern mining. Each trace becomes a string over the signature alphabet; closed frequent subsequences over support θ are the recurring behaviors.
- Economic trigger. A behavior is compiled only when projected savings on its model-generated step clear the one-time cost of training the specialist.
- Verified harvest. Training data is collected along the behavior and screened by an automatic verifier, with an LLM judge confined to screening outputs, never to judging similarity.
No stage uses a similarity threshold.
Canonical signatures and the ladder
Canonicalization is not a single choice but a ladder of increasingly abstract maps, from exact byte equality through literal- and structure-normalized forms to semantic equivalence. We do not hide that choice inside a scalar; we expose it as a short list of readable rules an independent party can audit and contest. To each signature we attach a flag for whether its variable slot is model-generated (the SQL the agent writes, the field it extracts) because only model-generated steps are candidates for a learned specialist; a deterministic clone or fetch is ordinary caching, a different problem.
On a real workload, the collapse is dramatic. Across 19,562 tool operations, the signature space contracts 46× at the very first level, before any normalization, and settles to 49 distinct skeletons. The top nine of those cover 94.6% of all calls.
The canonical work loop
Mining consecutive signatures over the 1,785-trace coding workload recovers the canonical edit → write → run → inspectloop directly from the data, with no notion of similarity anywhere in the pipeline. The same short behaviors recur across ~12% of traces despite each underlying task differing: the “different surface request, identical behavior” structure the method is built to find.
| Behavior | Occurrences | Support |
|---|---|---|
| edit → write_file | 246 | 12.9% |
| execute → run_command | 241 | 12.8% |
| run_command → edit | 237 | 12.7% |
| read_file → think | 232 | 12.3% |
| execute → ls | 231 | 12.2% |
| ripgrep → execute | 228 | 12.1% |
The economic trigger
Frequency makes a behavior a candidate; economics decides whether to act. Replacing a model-generated step with a specialist does not eliminate its cost: the agent runs the specialist, screens its output with a cheap online check, and falls back to the frontier when that check rejects. Writing c_call for the frontier step, c_spec for the specialist, c_ver for the check, and α for the rate it accepts, the expected saving per invocation is Δ = α·c_call − c_spec − c_ver, and a behavior is worth compiling once N̂·Δ > c_train, i.e. past a break-even count N* = c_train / Δ. The threshold is denominated in compute, not chosen by hand.
The cost gap that makes this worth doing is real. The model-generated step (n = 17,854) costs a median of $0.025 against a cheap-model tier present in the same workload at $0.012–$0.019 per call, a 40–60× spread, which is exactly the headroom the trigger trades against training cost.
At the measured median cost, with a 95% accept rate, the break-even lands near N* ≈ 3,600 invocations against a compute-only training cost, comfortably above the feasibility floor, and well inside the volume the busiest behaviors see.
The decision rule discriminates
The same inequality produces five distinct verdicts, not just go/no-go. A frequent, model-generated behavior with varying inputs is compiled; a model-generated step whose inputs collapse onto a few signatures is memoized rather than specialized; a frequent but deterministic step is cached as plain I/O; a model-generated behavior with no available verifier is deferred; and a genuinely rare behavior is left on the frontier because it never repays training.
| Behavior | N̂ / yr | Decision |
|---|---|---|
| connect → query[agg] → format | 18,000 | specialize |
| query[fixed] → format | 12,000 | cache |
| fetch → parse | 95,000 | cache |
| search → summarize | 11,000 | defer |
| query[rare skeleton] | 500 | frontier |
Verified harvesting
For a behavior the criterion selects, the corpus already contains every instance the agent ever ran. We collect the input/output pairs at its model-generated step and keep only those that pass an automatic verifier: an execution oracle where one exists (the query runs and matches a reference result; the code passes its tests), or an LLM judge confined to screening those concrete outputs. The judge never decides whether two behaviors are the same; that was settled discretely in stage one. Keeping that boundary sharp is what stops non-reproducible similarity judgments from leaking back into the method. The surviving pairs train a small base model adapted by LoRA, and the same verifier later gates whether the specialist is allowed to serve.
What this establishes, and what it does not
On 57,939 real agent-execution trace events we validate the method’s premise and machinery: typed operations collapse 46× under discrete canonicalization, frequent-subsequence mining recovers the canonical work loop, the model-generated step shows a real frontier-to-cheap cost spread, and the verified-harvest design corresponds to an operational judge layer running in production.
We are deliberately explicit about what the traces do not yet show. They do not isolate a per-behavior break-even hit rate, a measured harvest yield, or a trained specialist matching the frontier at reduced cost. Those are the next measurements, and we present this as the discovery foundation for self-specializing agents, scoped to behaviors that admit an automatic correctness oracle, rather than as the finished end-to-end result.