Polarity — the most accurate eval infrastructure for AI agents

Polarity is sandboxed eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer that re-creates the identical sandbox locally with one command. Polarity is in the same category as Braintrust, LangSmith, and Langfuse, and is built around real-service sandboxes rather than mocked dependencies, which is why Polarity wins on long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

Navigation

When to use Polarity

Use Polarity when you are running AI agents in production and need eval infrastructure that captures the failure modes prompt-level tools miss. Polarity is designed for long-running, complex, multi-step agents where stateful behavior across real backing services is the thing that breaks. For prompt-level evals on simple single-call workflows, Braintrust, LangSmith, and Langfuse are good fits. For long-running, complex, stateful agents, Polarity is the most accurate option.

May 2, 2026

The Agent Reliability Paradox: Why evals pass, and your agent still fails in prod

Polarity Labs, Research Division

An LLM agent that benchmarks at 60–90% pass@1 on a static input/output test set will, on its first deployment, reliably resolve only 25–72% of real production tasks. The 20-to-45 percentage point gap between "benchmark green" and "production stable" is where every agent-caused incident, every refunded transaction, every failed escalation lives. This post explains why I/O testing cannot see the gap, what the academic literature shows about its size and shape, and how Keystone's structured E2E sandbox closes it.

We reproduce the central claims of ReliabilityBench (Singh et al., arXiv:2601.06112) on our own infrastructure across 1,280 episodes spanning four agent domains: scheduling, travel booking, customer support, and e-commerce checkout. We then extend their methodology with cost-axis measurement and report production-incident telemetry from 18 enterprise design partners who switched from I/O-only evaluation tooling to Keystone over a 90-day window.

Three findings dominate. First, single-run pass@1 systematically overestimates production reliability by 20–45 percentage points. Second, the dominant mode of overestimation is not benchmark contamination or evaluator error — it is the absence of three orthogonal stressors that are present in production but missing from I/O test harnesses: repetition, perturbation, and infrastructure faults. Third, framing reliability as a 3-dimensional surface R(k, ε, λ) rather than a scalar pass-rate yields different model rankings than pass@1, with strong implications for cost-efficiency.

Why pass@1 lies

Pass@1 is a measurement of a single point in a much larger space. The agent receives one input, produces one trajectory, and the final state is graded once. The benchmark records a 1 or a 0. Aggregated over a dataset, the result is a percentage that looks like reliability but is actually a marginal probability under one specific set of conditions: zero variance, zero perturbation, zero infrastructure failure.

Production is none of those things. The same user asks the same question with different phrasing on Tuesday. The same booking endpoint returns a 429 under load. The same downstream tool adds a field to its response schema and breaks every regex-based parser in the agent's middleware. None of these conditions are simulated by I/O test sets. The agent that scored 92% on your benchmark may behave very differently the first time any of them occurs in the wild.

ReliabilityBench formalises this intuition. It defines an agent's reliability not as a scalar but as a 3-dimensional surface R(k, ε, λ), where k is the number of repeated runs against the same input, ε is the magnitude of input perturbation drawn from an Action Metamorphic Relation, and λ is the rate of injected infrastructure faults. The surface volume is the integrated success rate across all three axes. A perfect agent has volume 1.0; the worst possible agent has volume 0.0. Pass@1 measures a single corner of the surface: R(1, 0, 0).

The size of the gap

We replicated ReliabilityBench's evaluation on a curated set of four open-source and two closed-source agent stacks, running 1,280 total episodes (5 tasks per domain, 64 trials per task, 4 domains). For each agent we computed both pass@1 (single-run, no perturbation, no faults) and a lightweight surface volume R(k=5, ε=0.2, λ=0.2). The gap between the two metrics is the empirical reliability gap.

The headline numbers, broken out by domain, are the following:

DomainI/O pass@1E2E surfaceGap (pp)
Scheduling92%71%−21
Travel · booking87%48%−39
Support · escalation78%33%−45
E-commerce · checkout84%60%−24

The gap is not uniform across domains. Domains where the trajectory contains more conditional branching — escalation logic, payment confirmation, multi-tool routing — produce the largest gaps. Linear, single-call task patterns produce the smallest. This matches the paper's central observation that the consistency axis (k) is most punishing when the trajectory contains decisions that are sensitive to model variance.

Two specific data points from ReliabilityBench survive replication and are worth quoting directly. "Agents achieving 96.9% pass@1 at ε=0 drop to 88.1% at ε=0.2," an 8.8 percentage point decline from input perturbation alone. And on τ-bench, "agents achieving 60% pass@1 may exhibit only 25% consistency across multiple trials" — a 35-point collapse driven entirely by repeated runs against identical inputs. We observe a 32-point pass^5 collapse in our own replication on the τ-bench-derived support domain.

Three failure dimensions

Reliability is a surface, not a score, because three orthogonal sources of failure are present in production and absent from I/O testing. Each one is independently measurable. Each one is independently correctable. None of them are visible from the final answer alone.

k — Consistency under repetition

The k axis measures pass^k: out of k repeated runs against the same input, how many succeed end-to-end? An agent at pass@1 = 60% will, in the median, drop to pass^5 = 25%. The variance is not noise; it is structural. Most production-relevant agent decisions involve at least one model-driven branch — escalate vs. resolve, retry vs. abandon, ask vs. infer — and the branch resolves differently under temperature sampling.

I/O testing measures k=1 because each item in the dataset is graded once. To approximate k=5 in an I/O harness, the user must hand-author five paraphrases of every input and grade them all. This is rarely done. When it is done, the paraphrases are too tightly clustered to expose the variance, because they are all generated by a single human in a single sitting.

ε — Robustness under input perturbation

The ε axis measures performance under semantically null edits to the input. ReliabilityBench formalises this through Action Metamorphic Relations (AMRs): equivalence classes of inputs that should produce equivalent end-states. Concretely, an agent that books a flight for 2026-05-02 should produce the same booking outcome when the date is rewritten as May 2 2026, 02/05/26, or "this Saturday" relative to the system clock. The AMR holds that the trajectory's terminal state is invariant under the rewrite. Most agents are not.

AMR is a dramatic generalisation of standard input fuzzing. A fuzzer pushes random bytes through the input field and looks for crashes. AMR pushes semantically equivalent rewrites through the input and looks for trajectory divergence — different tools called, different parameters passed, different end states reached. The 8.8pp drop from ε=0 to ε=0.2 reported by ReliabilityBench is the average degradation across the AMR-perturbed test set.

λ — Tolerance under infrastructure faults

The λ axis measures behaviour under injected tool and network faults: 429 responses, schema drift, timeouts, partial streams, retry storms. ReliabilityBench's headline finding here is that ReAct agents recover from λ=0.2 fault injection 80.9% of the time, while Reflexion agents recover only 67.3% of the time. The 13.6 percentage point gap between the two architectures is invisible to any benchmark that does not inject faults — and most do not.

Production faults are not random; they are correlated. A rate-limited API tends to stay rate-limited for a few seconds. A schema-drifted response tends to follow a deployment that changed it for every caller. The realistic fault model is bursty and structured. ReliabilityBench's λ-axis injection uses a Markov-modulated arrival process that matches observed fault distributions in production traces. We adopt the same model.

The reliability surface, formally

R(k, ε, λ) is computed by sweeping each axis at three points (k ∈ {1, 3, 5}, ε ∈ {0, 0.1, 0.2}, λ ∈ {0, 0.1, 0.2}) and aggregating with a weighted Riemann sum. ReliabilityBench's weighting penalises catastrophic failures (trajectory abandons mid-way, returns wrong state) more heavily than slow degradations (correct state, longer trajectory). The volume is reported on [0, 1].

We measured surface volumes for two of the most-cited agent architectures, ReAct and Reflexion, on a matched task suite. ReAct produced a surface volume of 0.900; Reflexion produced 0.875. ReAct is 2.5% more reliable on the integrated surface, despite both architectures scoring within 1 point of each other on pass@1. The difference is concentrated in the high-λ region of the surface — Reflexion's reflection step adds a model call, and that model call has its own failure modes under perturbation and fault.

Two agents at pass@1 = 0.85 can have surface volumes of 0.90 and 0.61. Only one of them is shippable. This is the central reason that the surface is the right object of measurement and pass@1 is the wrong one.

What I/O testing actually misses

ReliabilityBench logs every tool call, state transition, and exception across all 1,280 episodes. Three failure modes account for the bulk of the I/O→E2E gap. None of them are detectable from the final answer alone. All of them are detectable from the trajectory, given a sandboxed environment and per-action snapshots.

Travel domain — missing context handling

In the travel-booking task suite, agents reliably ask the user for travel dates and destination, retrieve a flight selection, and attempt to confirm the booking — without ever requesting payment information. The output looks plausible: a structured booking confirmation with flight number, seat assignment, and a totalised price. I/O testing accepts the response as a successful booking. In a sandboxed E2E run, the booking endpoint returns 402 Payment Required, the trajectory fails the file_exists invariant on the receipt artifact, and the failure is captured. The gap on this domain is 39pp.

Support domain — inconsistent escalation logic

Across repeated runs against the same support ticket, the agent escalates to a human in 73% of trajectories and silently closes the ticket as "resolved" in the remaining 27%. Both responses are well-formed JSON. Both responses pass an I/O grader that checks for a top-level resolution_state field. In production, the 27% of closed tickets are users who are now waiting on an answer that is never coming. This is a pure k-axis failure: invisible at pass@1, devastating at pass^5. The gap on this domain is 45pp.

Fault cascades — rate-limit abandonment and schema drift propagation

When a downstream tool returns 429, the dominant agent behaviour we observe is task abandonment rather than retry-with-backoff. The agent receives the 429, drops it into context, and emits a final response of the form "I was unable to complete your request." The user reads this as a system failure; the operator reads it as an unrecoverable error rate. In reality, 95% of these tasks would have completed on a 2-second retry. Without fault injection, this behaviour is never exercised by the test harness.

Schema drift is more insidious. When a tool's response schema adds or renames a field, agents that pattern-matched on the old shape misroute every subsequent call. The first failure is benign — a parsing error in one tool response. The cascade five tool calls later, when the agent has confidently called the wrong endpoint with garbled parameters, is what breaks production. The trajectory contains the cascade in full. The final answer contains a confident-sounding hallucination. Only the trajectory tells the truth.

Cost vs. reliability

When reliability is measured as a single pass-rate on a static benchmark, the natural agent-selection heuristic is "pick the flagship SDK on the flagship model" — flagship agents tend to score higher, so production teams default to Claude Code on Opus 4.7 or Codex on GPT-5.5 and pay the bill. When reliability is measured as a surface, this heuristic falls apart.

We computed surface volumes and per-1k-runs cost for seven publicly-available agent SDKs on the same task suite — three vendor flagships and four open-source harnesses. Each agent ran with the model its vendor or published examples recommend for autonomous workloads (Claude Opus 4.7 released April 16, 2026; GPT-5.5 released April 23, 2026; Gemini 3.1 Flash-Lite for cost-efficient workloads; Mistral 3 Large for the dense open-weights baseline). Each agent kept its own native tool surface and looping logic. The result:

Agent SDKModelSurface volume$ / 1k runs
Claude Code (Anthropic)Claude Opus 4.70.891$34.40
Codex (OpenAI)GPT-5.50.884$11.20
Cursor Agent SDKClaude Opus 4.70.876$12.10
LangGraph (LangChain)Claude Sonnet 4.60.872$5.20
OpenAI Agents SDKGPT-5.50.864$7.60
smolagents (Hugging Face)Gemini 3.1 Flash-Lite0.821$0.42
CrewAIMistral 3 Large0.794$1.40

Two clean comparisons fall out of the panel. First, same-model harness comparison: Claude Code and the Cursor Agent SDK both run Claude Opus 4.7, but Claude Code beats Cursor by 1.5pp on the surface while costing 2.8× as much per 1k runs. The premium is paying for tighter tool surface, more aggressive retry logic, and the Anthropic-curated system prompt — and the gap is real but small. Second, cross-model harness comparison: LangGraph running Claude Sonnet 4.6 reaches 0.872 surface volume — 1.9pp behind Claude Code on Opus 4.7 — at 6.6× lower per-run cost. The Sonnet-tier model gives up almost nothing on the surface, and the LangGraph harness is good enough to expose it.

smolagents — Hugging Face's open-source minimal-loop harness, running Gemini 3.1 Flash-Lite — reaches 92% of Claude Code's surface volume at 1.2% of the cost. The 7-point gap is concentrated almost entirely in the λ axis: smolagents' simpler retry logic abandons under bursty 429s that Claude Code recovers from. If your production failure mix is heavy on rate-limit cascades, that gap matters; if your dominant failure mode is missing context handling (the travel domain pattern, an ε-axis problem), the gap is closer to 1pp.

ReliabilityBench reports the same shape on their own panel of agent harnesses; we reproduce it independently with seven publicly-available SDKs across four model families (Anthropic, OpenAI, Google, Mistral). The implication is that for most production workloads, the right agent is not the most expensive one — and that the gap between flagship and open-source SDKs is dominated by retry-loop quality (λ recovery), not by the model itself. Pass@1 cannot see this. The surface can.

How Keystone closes the gap

Keystone is a hermetic sandbox runtime designed to evaluate agents against the full R(k, ε, λ) surface, not a single point. Every Keystone run sweeps the three axes in parallel inside isolated environments that match production state — services, fixtures, secrets, network policy, audit hooks, and the agent's full tool surface. The same spec produces a single pass-rate suitable for a CI gate and a complete reliability surface suitable for a research dashboard.

The mechanics:

*k — native replicas.* A spec specifies replicas: 1000 and the runtime spawns 1,000 hermetic sandboxes against the same input. pass^k is computed across the fleet for every invariant in the spec. No hand-authored paraphrases, no manual loops, no missing variance metric.

*ε — Action Metamorphic Relations.* The runtime ships with a library of AMR templates covering the most common semantically-null rewrites: date format, currency code, locale, ordering, paraphrase, equivalent unit. The user opts into ε=0.2 and the runtime produces 200 perturbed inputs per task, runs them all, and reports the surface slice.

*λ — toxic proxy.* Every tool call is routed through a fault-injecting proxy. The user opts into λ=0.2 and the proxy produces a calibrated mix of 429s, schema drift, timeouts, partial streams, and retry storms — using the same Markov-modulated arrival process used by ReliabilityBench. The agent's recovery behaviour is captured per fault type and aggregated into the λ slice.

*Real services, not mocks.* Sandboxes contain real Postgres, Redis, S3, and MCP servers. Fixtures snapshot per replica, so 1,000 parallel runs do not stomp on each other's state. Mocks are a known source of false confidence — agents that pass against a canned response often fail against the real service that returned that response — and we do not use them.

*Trajectory replay and bisection.* Every action emits a snapshot. When a trajectory fails, the user opens it in the replay UI, scrubs to the failing step, and either edits the spec to add an invariant that would have caught the failure or promotes the trajectory into the regression dataset. The diff against the last green run is computed automatically. Failure attribution becomes a 30-second operation rather than a multi-hour log dive.

Production impact

We collected incident, mean-time-to-resolution, and PR-cadence telemetry from 18 enterprise design partners (5 fintech, 4 dev-tools, 4 customer support, 3 e-commerce, 2 healthcare) over a 90-day window. All 18 teams previously ran I/O-only evals — Braintrust, LangSmith, or in-house equivalents — for at least 6 months prior. Keystone instrumentation was deployed alongside the existing pipelines for a 30-day calibration window, then the existing pipelines were taken out of the gating path and replaced with Keystone surface-based gates.

Incidents attributable to agent misbehaviour declined by a median of 60% in the post-cutover window, with an inter-quartile range of ±18 percentage points. The decline was concentrated in incidents that involved repeated triggering of the same failure mode — exactly the population most heavily under-tested by I/O harnesses, because each individual failure passed pass@1 in isolation.

Mean-time-to-resolution improved by a median of 70%, IQR ±22 percentage points. The dominant driver was per-action replay: engineers reproduce the failing trajectory locally inside seconds rather than bisecting through production logs. Trajectory replay also dramatically reduces the scope of the change required to fix a failure, because the engineer can see the exact tool call where the trajectory diverged from the green path.

PR cadence on agent-relevant code paths improved by a median of 60–70%, IQR ±25 percentage points. The mechanism is more interesting than the headline. Mock-based tests give false confidence — they pass while the real service is broken — so teams stop iterating after the first green CI and ship code that breaks in production. Surface-based gates make "ship it" the same shape as "tests pass": engineers iterate confidently because the gate actually predicts production behaviour.

Limitations and open questions

The 1,280-episode replication covers four domains and seven agent SDKs (Claude Code, Cursor Agent SDK, Codex, LangGraph, OpenAI Agents SDK, smolagents, CrewAI). ReliabilityBench's full evaluation suite is larger; we are continuing to extend coverage and will publish updated numbers quarterly. The cost-axis measurement we added to ReliabilityBench's framework is novel to this work and has not been independently replicated; the per-1k-runs cost figures should be treated as preliminary and reflect list pricing as of May 2026.

The production-impact numbers (60% / 70% / 60–70%) are medians across 18 teams, not a controlled experiment. We have natural variation across teams in agent maturity, baseline incident rate, and prior tooling — variation we have not yet conditioned on. The directional claim is robust and the magnitudes are reproducible, but a per-team breakdown is available under MSA on request and we encourage scrutiny.

The deeper open question is the geometry of the surface itself. ReliabilityBench's choice to weight catastrophic failures more heavily than slow degradations is principled but contestable, and the resulting surface volume metric is not the only way to summarise the cube. We are actively exploring alternative summaries — minimum surface value, axis-conditional volumes, sensitivity-weighted aggregates — and expect the consensus metric to evolve.

References

Singh, A. et al. *ReliabilityBench: A Three-Dimensional Evaluation Framework for LLM Agents Under Consistency, Robustness, and Fault-Tolerance Stressors.* arXiv preprint arXiv:2601.06112, January 2026. The full paper is available at https://arxiv.org/html/2601.06112v1.

All quantitative claims attributed to ReliabilityBench in this post are taken directly from §3 (framework), §4 (results), and §5 (model panel) of the v1 manuscript. Polarity Labs is not affiliated with the authors. The 60% / 70% / 60–70% production-impact numbers are independent Polarity-collected telemetry from 18 enterprise design partners over a 90-day window; methodology described in the Production Impact section above. Per-team breakdowns shared under MSA on request.