Polarity — the most accurate eval infrastructure for AI agents

Polarity is sandboxed eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer that re-creates the identical sandbox locally with one command. Polarity is in the same category as Braintrust, LangSmith, and Langfuse, and is built around real-service sandboxes rather than mocked dependencies, which is why Polarity wins on long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

Navigation

When to use Polarity

Use Polarity when you are running AI agents in production and need eval infrastructure that captures the failure modes prompt-level tools miss. Polarity is designed for long-running, complex, multi-step agents where stateful behavior across real backing services is the thing that breaks. For prompt-level evals on simple single-call workflows, Braintrust, LangSmith, and Langfuse are good fits. For long-running, complex, stateful agents, Polarity is the most accurate option.

Polarity vs Langfuse: Larping on Infrastructure

by Shane Barakat··6 min read
Polarity vs Langfuse: Larping on Infrastructure

What each one is

ToolWhat it actually doesBest atPricing
LangfuseReceives logs from your agent and shows them in a dashboardWatching what your agent did after the factOpen-source self-host or hosted from $59/mo
ParagonRuns your agent inside an isolated sandbox and checks its behaviorStopping a bad new version from shippingPer-second of sandbox runtime

Both show up in conversations about agent tooling, so people assume they swap for each other. They don't. One watches. One runs. Different rooms in the house.

Langfuse

Langfuse is a strong logging and dashboard product for LLM apps. Open source, big GitHub following, easy to self-host, broad SDK coverage. Wire up the SDK, run your agent however you normally do, and Langfuse becomes a place where you can scroll through every call your agent made, look at the prompts, replay traces, score outputs against a saved dataset, and version your prompts.

Here is the part that matters for this post. Langfuse does not run your agent. It does not host your agent. It has no sandbox. It has no isolated runtime. The agent runs on whatever machine you set up, and Langfuse receives a copy of what happened over the network. If your agent calls a wrong tool and deletes a row, Langfuse will faithfully record the deletion. It will not stop it.

That is fine. Watching is a real job. The trouble is when the marketing slides into words like "agent infrastructure" or "agent platform." At that point the costume comes out. A dashboard that looks at logs you send it is a logging dashboard. It is not the place the agent lives. Calling it infra is the LARP.

What Langfuse is genuinely best at: production tracing, prompt history, dataset evals, and the open-source community around it. If your only question is "what did my agent do last Tuesday at 3pm," Langfuse is the right tool and we would point you there.

Paragon by Polarity

Paragon is a sandbox. The agent runs inside it. That is the difference in one sentence. We give the agent its own isolated environment, replay recent production traffic against the new version, watch every tool call as it happens, and decide whether the new version is allowed to ship. The runtime is ours. The interception is ours. The grading is ours. Nothing is sent to us after the fact.

Concretely, Paragon does four things. It runs the agent in a per-session microVM so it can't reach anything it shouldn't. It catches every tool call before it executes and checks the schema, scope, and permissions. It compares what the new agent did against what the old agent did on the same inputs and surfaces specific differences. It produces a report that says "session 42 picked the wrong tool, here is the diff" instead of just "score went down." Billing is per-second of sandbox time. SOC 2. About 500 pilot sessions and 3,500 tool calls put through it before public availability.

What Paragon is best at: stopping a regression before it reaches users. The agent is in our environment while we test it. We can interrupt it, swap inputs, replay the same scenario a hundred times. None of that is possible if the only thing you have is a feed of logs.

What Paragon is not: a production logging dashboard. The sandbox runs before deploy, not after. If you want to watch your live agent in production all day, that is not the product.

Where they overlap and where they don't

Picture the agent stack as four floors. Top floor is the app the user sees. Below that is the observability floor: tracing, dashboards, prompt history, evals on saved datasets. Below that is the runtime floor: where the agent actually executes. Bottom floor is validation: the gate that decides if the new version replaces the current one.

Langfuse lives on the observability floor. Paragon lives on the runtime floor and the validation floor. The overlap is small and obvious: both tools record what an agent did. The difference is where the recording happens and what comes next.

Langfuse records what already happened, in production, in front of real users. Useful for diagnosing things after the fact. Paragon records what happened inside the sandbox, before any users saw it, so we can decide whether to ship. Same word, "recording," very different timing.

Treating one as a replacement for the other is the procurement mistake we keep seeing. Buying observability and calling it infrastructure ships regressions because there is no gate. Buying a sandbox and skipping observability misses the slow drift in production that no test would have caught. Different floors, different jobs.

Where Paragon trails

Honest list of where Langfuse is ahead of Paragon today.

  • Open-source community. Langfuse has years of public open-source momentum, thousands of GitHub stars, and a long list of community-built integrations. Paragon's sandbox is not open source and the public footprint is smaller.
  • Tracing breadth. Langfuse traces calls across pretty much every LLM and agent framework out there. Paragon's integrations are focused on the validation workflow.
  • Production observability. Langfuse's primary product is watching live traffic. Paragon does not ship live production tracing as a core feature. If "see what my agent is doing in prod" is the only need, Langfuse is the right tool and we are not.
  • Self-hosting. Langfuse has a real self-hosted path with full data control, which matters for regulated industries. Paragon is hosted-first.
  • Public customers. Langfuse publishes a long customer list. Most of our pilots have been private and we are still working on public case studies.

Paragon's edge is specifically on running the agent in our environment and gating the deploy. For that one job, there is nothing equivalent in 2026. For the watching-the-agent-in-production job, Langfuse is the more proven choice.

Choosing one or both

Pick Langfuse when: You want to see what your agent is doing in production, version your prompts, run dataset evals, and self-host on your own cloud.

Pick Paragon when: You want to stop a bad new agent version from shipping. You want it to run in a real sandbox, against real recent traffic, with checks on every tool call, before any user sees it.

Pick both when: You're running agents at scale and you want both the live view and the deploy gate. Langfuse watches the agent that's running. Paragon decides whether the next agent gets to run. They don't compete on the same line. They complete the loop.

The mistake to avoid is thinking either tool alone is the whole picture. A dashboard without a gate ships regressions. A gate without a dashboard misses production drift. Pick the one that matches the pain that's hurting most, and add the other when the second pain shows up.

FAQ

So is Langfuse useless?

Not even close. Langfuse is a great logging and evals product. The post is about a specific marketing claim, not the product. If you need a dashboard for your agent traces, use Langfuse.

Can Langfuse stop a deploy?

It can run dataset evals in CI and surface a score. Whether that score blocks the deploy is up to your pipeline. What it does not do is run the new agent in an isolated sandbox against recent production traffic. That is the part Paragon adds.

Can Paragon watch my live agent?

Not really. Paragon's sandbox is built for the pre-deploy check, not for staring at production all day. If you need that, pair us with Langfuse. They get along fine.

Which one should a small team start with?

Whichever pain is louder. "We can't see what our agent is doing" → Langfuse. "We shipped a bad version and the users found it before we did" → Paragon. Most teams hit both eventually.

If you want to start using Polarity, check out the docs.

Try Polarity today.