Token Optimization for Agents: When Token Usage Is a Correctness Signal

by Jay Chopra··5 min read
Token Optimization for Agents: When Token Usage Is a Correctness Signal

Why the cost framing misses the point

"Token optimization" usually gets pitched as a cost-cutting exercise. Compress the prompt. Swap to a cheaper model. Cache more aggressively.

Those tactics help at the margin. They do not address the dominant cause of agent token waste in 2026, which is behavioral: the agent is doing extra work that did not need to be done.

An agent that used to call two tools and now calls five has a behavior problem, not a prompt-compression problem. An agent that retries a failed tool call three times before giving up has a policy problem. An agent that generates a 400-word response when 80 words would suffice has a prompting problem, but specifically a drift problem, not a compression problem.

The cost framing tells you to minimize tokens. The correctness framing tells you to minimize wasted tokens, which is a different optimization target. The wasted tokens are the ones where the agent is doing the wrong thing.

Four patterns where high tokens mean wrong behavior

1. Wrong-path tool-call chains

An agent used to call search_orders(id) then send_email(order). After a prompt update, it calls search_customers(name) then search_orders(customer_id) then send_email(order). Same final email. Three tool calls instead of two. Tokens consumed by the extra call plus the extra model turn to decide it.

The final output looks correct. An eval passes. The token bill quietly goes up across thousands of sessions. The behavior is the regression; tokens are just how you notice.

2. Retry loops on failed tool calls

A tool returns a transient error. The agent retries. The retry also fails. The agent retries again. By the time the agent gives up, it has burned four extra model turns plus four tool-call token counts, all for a session that should have either succeeded on the first call or gracefully handed off.

This usually means the agent's retry policy is too loose, or it lacks a concept of "hand off after one failure." Both are behavioral fixes. Composio's 2026 tool-calling guide treats retry discipline as a first-class reliability concern and it is.

3. Unnecessary re-planning mid-task

Multi-step agents sometimes re-plan when they hit an unexpected tool output. Occasionally that is the right call. Often it is the agent second-guessing itself because the prompt leaves the planning boundary unclear. Every re-plan is a full model turn with the planning prompt and the full context window.

A sandbox that records trajectories shows you the re-plans. You can see which inputs trigger them and tune the planning prompt to stop the wasteful cases.

4. Verbose-mode drift

The agent slowly starts producing longer responses over time. No single response is wrong. The distribution shifts toward wordier. At scale this is a meaningful token bill and a UX problem, but there is no individual failure to fix.

This is drift. Uptime Robot's 2026 guide calls out response-distribution drift as a category. Sandboxes catch it by comparing recent traffic replays to older baselines.

What to do about it

The tactical fixes are behavior-level, not compression-level.

  • Compare trajectories. Replay recent production traffic through each new agent version inside a sandbox. If the new version takes longer paths for equivalent inputs, that is the regression.
  • Enforce retry policy. Write explicit rules for when the agent may retry and when it must hand off. Verify the rules at tool-call time inside the sandbox.
  • Constrain planning boundaries. Make it clear in the system prompt when the agent should execute and when it should re-plan. Test both boundaries with injected edge cases in the sandbox.
  • Track response-length distribution over time. Observability tools surface this post-deploy. Sandboxes surface it pre-deploy by comparing replay to baseline.

Compression helps at the margin. Behavior fixes hit the actual source of waste.

When high token usage is actually fine

Not every expensive run is a regression.

  • Legitimately hard sessions. Long, ambiguous user requests genuinely need more planning, more retrieval, and more tool calls. The signal is not raw token count; it is tokens-per-outcome relative to comparable sessions.
  • Intentional depth. Research agents and code-generation agents are built to explore. High tokens are the feature, not the bug.
  • Cold-cache first turns. The first turn of a session often loads context that subsequent turns reuse. Comparing first-turn tokens to mid-session tokens is apples to oranges.

The check is always behavioral: is the agent doing more work than it needs to for this class of task? Sandboxes make that comparison by grouping sessions by intent and comparing within-group.

FAQ

Is prompt compression a bad idea?

No — but it doesn't address the dominant source of token waste. Run both: compress prompts, fix behavior.

Can an eval tell me if my agent is using too many tokens?

It shows count, not whether count is wrong. A sandbox compares trajectories and tells you the tool-call path changed.

Is a token increase a regression or just harder sessions?

Compare trajectories on matched sessions, not totals. If per-session count is flat, traffic is just harder.

If you want to start using Polarity, check out the docs.

Try Polarity today.