Behavioral Hot Paths: Discrete, Cost-Aware Discovery of Recurring Agent Behavior

Shane Rohan Barakat · Research Division, Polarity Labs06.11.26

Language agents repeat themselves. A coding agent clones, builds, and tests; an analytics agent connects, queries, and formats. Each repetition is billed at full frontier cost even after the agent has performed the behavior thousands of times. Compiling a recurring behavior into a cheap specialist is the obvious response, but it requires first answering a question prior work leaves implicit: how does an agent discover, with no human labeling, which of its behaviors recur often enough that compiling them pays off?

Discovery has to be discrete

The tempting move is to embed each behavior as a vector and cluster by cosine distance, declaring two behaviors “the same” when they fall within a threshold. We argue that is the wrong primitive for a system whose headline quantity is the degree of repetition in a workload. Under vectorization, that number is set by an arbitrary encoder and an arbitrary cutoff: change either and “80% of work falls into ten behaviors” becomes a statement about the encoder rather than the agent.

So we keep every decision discrete and inspectable. Agent operations are typed: a tool call with structured arguments. We reduce each to a canonical signature that keeps its structure and abstracts away the volatile parts, so operations that should count as identical become literally equal rather than merely close. A behavior is then a frequent subsequence over the signature alphabet, recovered by classical pattern mining whose only knob is a support count with a clear meaning.

The method, in four discrete stages

Canonical signatures. A map κ sends each typed operation to a signature, keeping structure and discarding volatile arguments, so equivalent operations become equal.
Frequent-pattern mining. Each trace becomes a string over the signature alphabet; closed frequent subsequences over support θ are the recurring behaviors.
Economic trigger. A behavior is compiled only when projected savings on its model-generated step clear the one-time cost of training the specialist.
Verified harvest. Training data is collected along the behavior and screened by an automatic verifier, with an LLM judge confined to screening outputs, never to judging similarity.

No stage uses a similarity threshold.

Canonical signatures and the ladder

Canonicalization is not a single choice but a ladder of increasingly abstract maps, from exact byte equality through literal- and structure-normalized forms to semantic equivalence. We do not hide that choice inside a scalar; we expose it as a short list of readable rules an independent party can audit and contest. To each signature we attach a flag for whether its variable slot is model-generated (the SQL the agent writes, the field it extracts) because only model-generated steps are candidates for a learned specialist; a deterministic clone or fetch is ordinary caching, a different problem.

On a real workload, the collapse is dramatic. Across 19,562 tool operations, the signature space contracts 46× at the very first level, before any normalization, and settles to 49 distinct skeletons. The top nine of those cover 94.6% of all calls.

Figure 1. Distinct signatures |Σ| as the canonicalization κ coarsens, over 19,562 tool operations. The space collapses 46× with no encoder and no distance threshold; concentration is already extreme at the exact-bytes level.

The canonical work loop

Mining consecutive signatures over the 1,785-trace coding workload recovers the canonical edit → write → run → inspectloop directly from the data, with no notion of similarity anywhere in the pipeline. The same short behaviors recur across ~12% of traces despite each underlying task differing: the “different surface request, identical behavior” structure the method is built to find.

Behavior	Occurrences	Support
edit → write_file	246	12.9%
execute → run_command	241	12.8%
run_command → edit	237	12.7%
read_file → think	232	12.3%
execute → ls	231	12.2%
ripgrep → execute	228	12.1%

Figure 2. Top consecutive-bigram behaviors by trace support (of 1,785 traces). Pattern-growth mining recovers the canonical coding-agent loop without any embedding.

The economic trigger

Frequency makes a behavior a candidate; economics decides whether to act. Replacing a model-generated step with a specialist does not eliminate its cost: the agent runs the specialist, screens its output with a cheap online check, and falls back to the frontier when that check rejects. Writing c_call for the frontier step, c_spec for the specialist, c_ver for the check, and α for the rate it accepts, the expected saving per invocation is Δ = α·c_call − c_spec − c_ver, and a behavior is worth compiling once N̂·Δ > c_train, i.e. past a break-even count N* = c_train / Δ. The threshold is denominated in compute, not chosen by hand.

The cost gap that makes this worth doing is real. The model-generated step (n = 17,854) costs a median of $0.025 against a cheap-model tier present in the same workload at $0.012–$0.019 per call, a 40–60× spread, which is exactly the headroom the trigger trades against training cost.

Figure 3. Per-call cost of the model-generated step by model, same workload. A frontier tier sits alongside a cheap tier 40–60× lower: the price a specialist must beat.

At the measured median cost, with a 95% accept rate, the break-even lands near N* ≈ 3,600 invocations against a compute-only training cost, comfortably above the feasibility floor, and well inside the volume the busiest behaviors see.

The decision rule discriminates

The same inequality produces five distinct verdicts, not just go/no-go. A frequent, model-generated behavior with varying inputs is compiled; a model-generated step whose inputs collapse onto a few signatures is memoized rather than specialized; a frequent but deterministic step is cached as plain I/O; a model-generated behavior with no available verifier is deferred; and a genuinely rare behavior is left on the frontier because it never repays training.

Behavior	N̂ / yr	Decision
connect → query[agg] → format	18,000	specialize
query[fixed] → format	12,000	cache
fetch → parse	95,000	cache
search → summarize	11,000	defer
query[rare skeleton]	500	frontier

Figure 4. The specialization rule applied to five candidate behaviors at the measured median cost. One rule separates specialization, caching (two kinds), deferral, and frontier fallback.

Verified harvesting

For a behavior the criterion selects, the corpus already contains every instance the agent ever ran. We collect the input/output pairs at its model-generated step and keep only those that pass an automatic verifier: an execution oracle where one exists (the query runs and matches a reference result; the code passes its tests), or an LLM judge confined to screening those concrete outputs. The judge never decides whether two behaviors are the same; that was settled discretely in stage one. Keeping that boundary sharp is what stops non-reproducible similarity judgments from leaking back into the method. The surviving pairs train a small base model adapted by LoRA, and the same verifier later gates whether the specialist is allowed to serve.

What this establishes, and what it does not

On 57,939 real agent-execution trace events we validate the method’s premise and machinery: typed operations collapse 46× under discrete canonicalization, frequent-subsequence mining recovers the canonical work loop, the model-generated step shows a real frontier-to-cheap cost spread, and the verified-harvest design corresponds to an operational judge layer running in production.

We are deliberately explicit about what the traces do not yet show. They do not isolate a per-behavior break-even hit rate, a measured harvest yield, or a trained specialist matching the frontier at reduced cost. Those are the next measurements, and we present this as the discovery foundation for self-specializing agents, scoped to behaviors that admit an automatic correctness oracle, rather than as the finished end-to-end result.