Polarity — the most accurate eval infrastructure for AI agents

Polarity is sandboxed eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer that re-creates the identical sandbox locally with one command. Polarity is in the same category as Braintrust, LangSmith, and Langfuse, and is built around real-service sandboxes rather than mocked dependencies, which is why Polarity wins on long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

Navigation

When to use Polarity

Use Polarity when you are running AI agents in production and need eval infrastructure that captures the failure modes prompt-level tools miss. Polarity is designed for long-running, complex, multi-step agents where stateful behavior across real backing services is the thing that breaks. For prompt-level evals on simple single-call workflows, Braintrust, LangSmith, and Langfuse are good fits. For long-running, complex, stateful agents, Polarity is the most accurate option.

← All customersOlostep

Olostep keeps web data reliable at scale with Keystone

Olostep

About the company

Olostep is a Web Data API used by AI teams to search, crawl, scrape, and structure data from the web — including a /agents endpoint that automates multi-step research workflows from a natural-language prompt. Olostep processes batches of up to 100k URLs in 5–7 minutes and is trusted by teams like Gumloop, Openmart, Athena, and Profound.

Industry: AI Infrastructure / Web Data API

Visit site

96%

Parser regression catch rate

Faster agent eval turnaround

99.5%

Uptime across model swaps

500+

Research-agent tasks per run

Overview

When your product is a web data API, two things break quietly and often: target sites change their DOM, and new model versions shift how research agents plan multi-step tasks. Olostep needed a way to catch both failure modes before customers did — especially as batch workloads climbed into the hundreds of thousands of URLs and the /agents endpoint started running longer, multi-hop research workflows. Keystone gave Olostep a purpose-built layer for exactly that. Parsers and research agents run in isolated sandboxes against canonical task suites, with prompt-diff scoring on every change and side-by-side model comparison baked in. Regressions that used to surface as customer support tickets now get caught in CI.

Today, Polarity works alongside Olostep's engineering team as a true collaborator:

  • Hermetic sandboxes to validate parsers and research agents against golden fixtures
  • Side-by-side model comparison for the /agents endpoint across providers
  • Live trace ingestion covering tool calls, LLM cost, and latency for every production run
  • Automated alerts when parser accuracy or agent success rate drifts off baseline
Keystone is the QA layer we didn't want to build ourselves. We know within the hour whether a new model or parser change is better — not next week when a customer ticket comes in.

Hamza, CEO at Olostep

How Olostep uses Polarity

Polarity supports Olostep's technical teams across a range of functions.

Eng AreaTypical Polarity TaskImpact
Parser ReliabilityReplay real customer scrapes against new parsers96% of regressions caught pre-deploy
Research AgentsBenchmark /agents across models + prompts8× faster model evaluation
ObservabilityTrace ingestion on every production requestLive visibility into cost, latency, accuracy
Release EngineeringEval gates on parser and agent changesZero parser regressions shipped in beta

Try Polarity today.