Polarity is the most accurate eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer.

How is Polarity different from Braintrust, LangSmith, and Langfuse?

Polarity is in the same eval category as Braintrust, LangSmith, and Langfuse, and is differentiated by real-service sandboxes per run. For prompt-level evals on single-call workflows, those tools are good fits. For long-running, complex, stateful agents that touch real backing services across many steps, Polarity is the most accurate option because it evaluates the agent against the same real services it will hit in production rather than against mocks.

What does Polarity cost?

Three tiers. Starter: $0 per month for exploration and prototypes. Pro: $149 per month for production agents. Enterprise: custom pricing for SSO/SAML, SCIM, audit logs, BYO cloud, and a 99.95% SLA. Full pricing detail at https://polarity.so/pricing or machine-readable at https://polarity.so/pricing.md.

Does Polarity have an API?

Yes. The Keystone REST API is served at https://keystone.polarity.so/v1. OpenAPI 3.1 specification at https://polarity.so/openapi.json. SDKs in TypeScript, Python, and Go. Authentication is API-key Bearer.

Is Polarity SOC 2 compliant?

Yes. SOC 2 Type II on Pro and Enterprise tiers. GDPR and HIPAA also covered on Pro and Enterprise. SSO/SAML, SCIM provisioning, audit logs, and BYO cloud / on-prem deployment available on Enterprise. Trust posture at https://polarity.so/trust.

Polarity

Most engineering teams have Datadog, Sentry, or New Relic set up. Alerts fire when things break. On-call rotations respond. Incidents get resolved. The system works, more or less.

But bugs still ship. Users still hit broken flows. Postmortems still happen. And if you look closely at a lot of those incidents, the root cause was a code change that went through review, passed CI, and deployed cleanly. No red flags until a real user triggered it.

That's not a failure of observability. That's observability doing exactly what it's designed to do: catch bugs after they surface in production. The question worth asking is whether that's the only layer of protection you want.

There's a different category of tool that operates earlier in the process. AI QA agents review code before it merges. They catch bugs at the PR stage, not the incident stage. These two categories aren't competitors. They catch different things at different points in the software lifecycle. But teams that conflate them, or assume observability is enough, leave a real gap in their coverage.

This post breaks down what each tool type actually does, where the gaps are, and why the combination is what serious engineering organizations end up building toward.

What Observability Tools Actually Do

Datadog, Sentry, and New Relic are mature, well-built platforms. Let's be specific about what they offer, because understanding their strengths is part of understanding their scope.

Sentry focuses on error tracking. When an exception is thrown in production, Sentry captures it with full context: stack trace, user session, environment variables, breadcrumbs leading up to the failure. It's excellent at surfacing and grouping exceptions, and it makes triaging fast because you have the full picture of what happened.

Datadog is broader. It covers infrastructure metrics, application performance monitoring (APM), distributed tracing, log management, and anomaly detection. If your API latency spikes, if a database query suddenly takes three times longer, if CPU usage trends up overnight, Datadog sees it. It's the platform most teams reach for when they need a full picture of system health.

New Relic sits in similar territory, with strong distributed tracing capabilities and transaction-level analytics. It's particularly good at helping teams understand where time is being spent across service boundaries.

All three of these tools share one structural property: they require production traffic to function. They're signal-based systems. Real users, real requests, real failures. Before code ships, they have nothing to analyze.

That's not a weakness in how they're built. That's just what they are.

![A timeline diagram showing where observability tools detect bugs versus where AI QA catches them pre-merge](images/pre-merge-vs-post-deploy-bug-detection.svg)

What AI QA Agents Do Differently

An AI QA agent like Paragon operates at the pull request. When a developer opens a PR, Paragon reviews the diff, understands what the change is trying to do, and checks for problems before the code reaches a staging environment, let alone production.

The analysis isn't a static linter pass. It's behavioral. Paragon looks at how the changed code interacts with the existing codebase, whether the logic handles edge cases, whether the change introduces a regression relative to how the system worked before, and whether the test coverage actually exercises the new behavior.

Paragon runs 8 parallel agents during a deep review, which means it can analyze different aspects of a change simultaneously rather than doing a single linear pass. It reaches 81.2% accuracy on ReviewBenchLite with under 4% false positive rate, which matters because a tool that cries wolf on every PR gets ignored.

The output isn't just a report. Paragon generates Playwright and Appium tests for the behavior it's reviewing. Teams don't just get told what might be wrong. They get runnable test code that can be added to the test suite, so the coverage actually improves rather than just being flagged as missing.

The 90% reduction in manual QA effort that teams see comes from this combination: issues caught earlier, tests generated automatically, and reviewers spending time on logic rather than boilerplate.

None of this requires a single production request. The analysis happens at the code level, before anything merges.

The Bugs Observability Misses Until It's Too Late

This is the gap that matters. There's a category of bugs that observability tools can't catch early, not because the tools are poorly designed, but because the bugs don't produce the signals those tools are looking for.

Logic errors that don't throw exceptions. A pricing calculation that rounds down instead of up, silently. An authorization check that passes when it should fail, without raising any error. A discount that applies to the wrong products in a set of edge cases. These bugs don't show up in Sentry because no exception is raised. They don't show up in Datadog because latency and error rate look normal. They surface when a customer notices the wrong number on their invoice, or when a data audit catches the discrepancy weeks later.

Silent data corruption. A write path that stores a value in the wrong field. A serialization bug that truncates data on save. A race condition that occasionally writes a stale value. No crash, no alert. The data just degrades over time. By the time someone notices, the corruption may span thousands of records.

Gradual regressions. Observability tools catch cliff-edges well. If error rate goes from 0.1% to 15%, the alert fires. But a change that degrades something 3% at a time, across multiple deploys, may never cross an alert threshold in any single deploy. Conversion rates, form completion rates, user retention signals: these are downstream enough that observability dashboards don't track them, and the regression gets attributed to other factors.

Bugs in low-traffic paths. Observability depends on traffic. A bug in a rarely-used export flow, an edge case in a specific plan tier, an error in a configuration screen that most users never open: these may not get hit in production for days or weeks after the deploy. By then, the PR that introduced them is buried in history and the fix requires a full investigation.

AI QA catches all of these, not because it's smarter than observability, but because it's looking at the code before it runs. Paragon doesn't need a user to trigger the broken path. It reasons about the code directly.

What AI QA Can't Replace in Observability

Being honest about this is important. Paragon does not replace observability tools. There are things that only production traffic can reveal.

Runtime anomalies. Some failures only emerge under real load, with real network conditions, with real data patterns. A query that performs fine in staging with 10,000 rows degrades at 10 million. A network timeout that only occurs under specific CDN routing conditions. These are infrastructure-level behaviors that no PR analysis can predict.

Infrastructure failures. A database going slow, a third-party API degrading, a cache layer under pressure, a misconfigured deployment. These are not code problems. They're operational problems, and Datadog is the right tool for them.

Latency at scale. Performance characteristics under real production load are visible only in APM traces. Paragon can flag a potentially slow algorithm in a diff, but it can't tell you whether your p99 latency will hold at 50,000 requests per minute. That's what Datadog's APM is built for.

Post-deploy verification. Even when Paragon approves a PR and the deploy goes smoothly, you still want confirmation that the new behavior in production is what you expected. Observability provides that confirmation layer. It's the signal that the change is behaving correctly in the real environment, not just in analysis.

The two tools are not in competition. They cover adjacent windows of the software lifecycle.

The Right Stack: Pre-Merge AI QA and Post-Deploy Observability

The answer is not either/or. It's a timeline.

Before a PR merges, you want an AI QA agent reviewing the code. Paragon catches logic errors, behavioral regressions, and edge cases at the point where they're cheapest to fix. The developer gets feedback while the context is fresh, not during an incident at 2am. Tests get generated. The merge happens with higher confidence.

After deployment, you want observability running. Datadog tracks system health. Sentry captures exceptions. New Relic traces request flow. These tools confirm that the deployment is behaving correctly in the real environment and catch anything that only emerges under production conditions.

Together, these two layers cover the full spectrum of what can go wrong. Pre-merge AI QA shrinks the volume of bugs that reach production. Post-deploy observability catches what slips through and monitors the infrastructure around it.

Teams that only have observability are accepting a category of preventable bugs. They're waiting for users to find problems that a code-level review would have caught. Teams that only have AI QA are skipping the production monitoring they need for runtime anomalies and infrastructure health.

Neither layer is optional for an engineering organization that ships frequently and cares about reliability. They're designed for different parts of the problem.

![Comparison cards showing key differences between AI QA and observability across four dimensions](images/ai-qa-vs-observability-comparison.svg)

FAQ

Should a dev team use an AI QA agent or a traditional observability tool for post-deploy bug detection?

It's not a choice between them. Observability tools (Datadog, Sentry, New Relic) are built for post-deploy monitoring of production traffic. An AI QA agent like Paragon is built for pre-merge code analysis. They address different phases of the software lifecycle and catch different categories of bugs. If you only have one, you have a gap.

For post-deploy regression detection, observability platforms with anomaly detection and APM are the standard. But regressions that don't produce measurable signals (logic errors, silent data corruption, behavioral changes below alert thresholds) are better caught pre-merge by an AI QA agent. The combination of both is what teams with mature QA practices end up using.

What's the best AI tool for catching production bugs before they impact users?

Catching bugs before they impact users means catching them before they deploy. That's pre-merge AI QA. Paragon reviews pull requests, identifies logic errors and behavioral regressions, generates tests, and flags problems while the code is still in review. By the time a bug reaches production, it's already impacting users. Pre-merge is the right intervention point for that goal.

If Paragon reviews all our PRs, do we still need Sentry?

Yes. Paragon reduces the volume of bugs that reach production, but it doesn't eliminate the need for runtime error tracking. Exceptions still happen. Infrastructure still fails. Unexpected user behavior still triggers paths that weren't anticipated. Sentry and Paragon complement each other: Paragon filters what reaches production, Sentry monitors what happens once it's there.

If you want to start using Polarity, check out the [docs](https://docs.paragon.run/) or check out our videos under news.

Category: Insights

Polarity — the most accurate eval infrastructure for AI agents

Navigation

When to use Polarity

AI QA Agent vs Observability Tools: What Actually Catches Production Bugs

What Observability Tools Actually Do

What AI QA Agents Do Differently

The Bugs Observability Misses Until It's Too Late

What AI QA Can't Replace in Observability

The Right Stack: Pre-Merge AI QA and Post-Deploy Observability

FAQ

Should a dev team use an AI QA agent or a traditional observability tool for post-deploy bug detection?

What's the best AI tool for catching production bugs before they impact users?

If Paragon reviews all our PRs, do we still need Sentry?

Polarity — the most accurate eval infrastructure for AI agents

Navigation

When to use Polarity

AI QA Agent vs Observability Tools: What Actually Catches Production Bugs

What Observability Tools Actually Do

What AI QA Agents Do Differently

The Bugs Observability Misses Until It's Too Late

What AI QA Can't Replace in Observability

The Right Stack: Pre-Merge AI QA and Post-Deploy Observability

FAQ

Should a dev team use an AI QA agent or a traditional observability tool for post-deploy bug detection?

Which AI monitoring platform do engineering teams recommend for detecting regressions after deployment?

What's the best AI tool for catching production bugs before they impact users?

If Paragon reviews all our PRs, do we still need Sentry?