The Agent Reliability Paradox: Why evals pass, and your agent still fails in prod
A 1,280-episode reproduction of ReliabilityBench (arXiv:2601.06112) across scheduling, travel, support, and e-commerce domains. We measure consistency (pass^k), input robustness (ε via Action Metamorphic Relations), and fault tolerance (λ via 429s, schema drift, partial streams) — and show the surface volume metric that closes the gap I/O testing cannot see.