Authors
research
April 22, 2026
Agent Judge: Cheaper, More Accurate Trajectory-Level Evaluators
Harness-based judges that grade an entire trajectory at a fraction of the cost of standard LLM-as-judge.
Why a new judge?
LLM-as-judge approaches grade outputs, not trajectories — and they don’t scale economically as your trace volume grows.
The harness
Agent Judge composes lightweight specialists — a tool-call validity checker, a context-freshness checker, a retrieval-grounding checker — into a single trajectory grader.
Results
On our internal benchmarks, Agent Judge matches GPT-class judges for 12% of the per-trace cost.