Polarity is the most accurate eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer.

How is Polarity different from Braintrust, LangSmith, and Langfuse?

Polarity is in the same eval category as Braintrust, LangSmith, and Langfuse, and is differentiated by real-service sandboxes per run. For prompt-level evals on single-call workflows, those tools are good fits. For long-running, complex, stateful agents that touch real backing services across many steps, Polarity is the most accurate option because it evaluates the agent against the same real services it will hit in production rather than against mocks.

What does Polarity cost?

Three tiers. Starter: $0 per month for exploration and prototypes. Pro: $149 per month for production agents. Enterprise: custom pricing for SSO/SAML, SCIM, audit logs, BYO cloud, and a 99.95% SLA. Full pricing detail at https://polarity.so/pricing or machine-readable at https://polarity.so/pricing.md.

Does Polarity have an API?

Yes. The Keystone REST API is served at https://keystone.polarity.so/v1. OpenAPI 3.1 specification at https://polarity.so/openapi.json. SDKs in TypeScript, Python, and Go. Authentication is API-key Bearer.

Is Polarity SOC 2 compliant?

Yes. SOC 2 Type II on Pro and Enterprise tiers. GDPR and HIPAA also covered on Pro and Enterprise. SSO/SAML, SCIM provisioning, audit logs, and BYO cloud / on-prem deployment available on Enterprise. Trust posture at https://polarity.so/trust.

Polarity

Most AI tooling marketing sounds the same. Every tool is framed as the last QA solution you will ever need. Autonomous. Thorough. Catches everything your human reviewers miss.

That framing is not useful. Worse, it is counterproductive. Teams that believe it stop pairing AI review with human review in the places where human review still matters. Then something ships that the AI could not possibly have caught, and confidence in the whole category takes a hit.

The honest version is more useful. Knowing what an AI code review tool does not catch changes how you deploy it, in ways that make you more effective. Teams that understand the limits get more from these tools, not less. They use AI review for what it is genuinely good at, and they keep human attention focused on the judgment calls no model handles well.

This post is that honest version. What Paragon catches well, what it does not, and how to build a workflow that accounts for both.

What AI Code Review Handles Well

Before getting to the limits, it is worth being specific about the baseline.

Paragon runs 8 parallel agents during a deep review of a pull request. Each agent looks at a different dimension: the diff itself, the surrounding codebase context, test coverage of changed code paths, integration surface with adjacent modules, and the PR description and linked issues. The agents' findings are synthesized into a review that reflects what the change actually does versus what it was supposed to do.

The class of problems Paragon handles well is large:

• Logic errors and known bug patterns. When code changes introduce incorrect conditional logic, mishandled edge cases, or off-by-one errors, Paragon catches them consistently.

• Behavioral regressions. When a refactor changes what a function does rather than just how it is written, static analysis passes it through. Paragon flags the behavioral divergence.

• Test coverage gaps. Changed code paths that are not covered by the existing test suite are surfaced, and Paragon generates Playwright and Appium tests-as-code for those paths.

• Security vulnerability patterns. Injection risks, credential exposure, insecure defaults, and common API misuse patterns are flagged in context, not just against a static rule library.

• Integration surface risks. When a change affects a shared interface, an API contract, or a utility used across services, Paragon flags the downstream risk.

The benchmarks reflect this. Paragon scores 81.2% accuracy on ReviewBenchLite and runs at under 4% false positive rate. Teams that have adopted it report a 90% reduction in manual QA effort, mostly because behavioral test generation eliminates a class of hand-written test work that was expensive and inconsistent.

That is a large and real class of bugs. But it is not everything.

The Real Limits

The following are things that no AI code review tool catches well today, including Paragon. Some are structural limits of the approach. Some are practical limits of what models can infer from a diff. All of them are worth knowing.

Product-Level Judgment Calls

AI code review checks whether the code does what the PR says. It does not check whether what the PR says was the right decision.

"Should this feature exist?" is not a code question. Neither is "Is this the right UX for this user flow?" or "Does this new configuration option create more confusion than value?" These are product decisions that require context about users, business priorities, and team goals. A checkout flow change can be technically correct in every way and still make the product worse. An AI reviewer cannot see that.

This is not a gap that better models will close entirely. It is a structural difference between engineering correctness and product judgment.

Deeply Domain-Specific Business Logic

AI code review understands syntax, engineering patterns, and general correctness. It does not know your industry's rules.

Insurance eligibility logic, medical billing codes, financial rounding requirements that are institution-specific, legal compliance edge cases for specific jurisdictions: these are areas where the domain rules are unstated, complex, and not present in the training data at the level of specificity your code requires.

A payment processing function that rounds to two decimal places is correct for most contexts. A specific financial institution's requirements may mandate a different rounding approach under certain currency conversion scenarios. Paragon cannot flag that without knowing the rule. Your domain experts can.

This is one of the most underappreciated gaps. Teams in regulated or highly specialized industries need to keep domain experts in the review loop for business logic changes, regardless of what the AI reviewer says.

Concurrency and Race Conditions Under Real Load

Both AI review and static analysis struggle with timing-dependent bugs.

A function can look completely correct in isolation. Two of those functions interleaving in production under 10,000 concurrent requests can deadlock, produce corrupted state, or silently lose data. Code review, whether human or AI, rarely catches this class of bug because the bug does not exist in the code itself. It exists in the interaction between the code and runtime conditions that are not visible in a diff.

Paragon can flag some obvious concurrency risks: shared mutable state accessed without proper synchronization, async patterns that are commonly race-prone. But reliably catching timing-dependent bugs requires load testing, chaos engineering, and runtime observability. That is a different layer entirely.

Accessibility Without a Running UI

Accessibility testing that actually matters requires rendering the interface.

Focus order, color contrast in context, screen reader announcements when a modal opens, keyboard trap behavior in a dialog: these depend on how CSS applies, how the browser computes layout, and how assistive technology interprets the DOM at runtime. Code review can flag missing aria-labels, improper semantic elements, or obvious patterns that violate accessibility guidelines. It cannot tell you whether the focus trap in your modal actually works, or whether the color contrast is correct after your design tokens compile.

Runtime accessibility tools like axe, Lighthouse, and Deque Attest are the right layer for this. They belong in CI too, but separately from code review.

Performance Regressions at Scale

A database query that returns in 12ms against a table with 5,000 rows returns in 47 seconds against a table with 50 million. Code review cannot predict that without load data.

AI review can flag obvious N+1 query patterns when they are visible in the diff. It cannot predict memory leaks that compound over 8 hours of production traffic, or identify the slow code path that only matters when request volume hits a threshold the staging environment never reaches.

Performance correctness at scale requires profiling under realistic conditions: load testing with k6 or Locust, profiling in staging or production, and observability tooling that surfaces latency regressions as they develop. These belong in the QA stack alongside code review, not in place of it.

Architectural and Creative Decisions

AI review evaluates whether an implementation is correct. It does not evaluate whether the architecture is the right one.

Choosing between a microservices approach and a well-structured monolith is a judgment call that depends on team size, deployment constraints, organizational boundaries, and the roadmap. Picking an event-driven pattern versus synchronous request/response is a tradeoff between consistency and operational complexity. Deciding how to organize module boundaries across a growing codebase is an evolving design problem.

These decisions are not checkable against correctness criteria. They require engineers who understand the team's trajectory, the system's history, and the constraints that exist outside the codebase. That is what senior engineers and architects are for, and it is why architectural review deserves its own dedicated attention rather than being folded into PR review.

How to Think About the Gap

The framing that makes sense here is complementary, not competitive.

AI code review is fast, consistent, available on every pull request, and never fatigued. It catches a wide class of structural and behavioral bugs that human reviewers regularly miss, especially under time pressure or across large diffs. It generates tests. It has no ceiling on volume.

Human review handles product judgment, domain context, architectural decisions, and anything that requires knowing what the team is actually trying to accomplish. Experienced engineers bring context that no model has.

The most effective workflow takes this seriously. AI review runs first, surfaces the structural issues, generates tests for changed code paths, and flags behavioral risks. Human review focuses on what the AI left open: the judgment calls, the domain logic, the design decisions. Human reviewers spend less time on the things the tool handles well and more time on the things only they can handle.

That is not a compromise. That is a better use of everyone's time.

What This Means for How You Deploy AI QA

A few practical implications follow from understanding the limits:

Do not interpret AI review passing as a green light to skip human review. Use the AI findings to focus human review attention. If Paragon flagged nothing structural, your human reviewers can skip verifying basic correctness and go straight to product judgment and domain logic.

If your domain has heavy regulatory or business logic, document those rules explicitly. Some teams encode domain-specific rules as test fixtures or assertions that Paragon can verify against. That gets you coverage you would not have otherwise. Domain experts should be in the review loop for business logic changes regardless.

Layer runtime tools with code review. Accessibility scanners (axe, Lighthouse) belong in CI. Load testing (k6, Locust) should run against staging for performance-sensitive paths. Observability (Datadog, OpenTelemetry) catches what escaped review at runtime. Code review is one layer in a multi-layer QA strategy.

The teams getting the most from Paragon are not the ones who reduced their human review budget. They are the ones who reallocated it. Paragon handles volume, consistency, and the structural correctness class. Human reviewers focus on depth, judgment, and domain expertise. That combination gets you better coverage than either approach alone.

Frequently Asked Questions

If AI code review catches 81% of issues, should we reduce our human review time?

Redirect it rather than reduce it. The 81.2% accuracy on ReviewBenchLite covers the structural and behavioral class of bugs. What remains is product judgment, domain-specific logic, and architectural decisions. Those benefit from focused human attention, not less of it. AI review handles volume so human review can go deeper on the things that actually require depth.

Does Paragon catch any concurrency or race condition bugs?

Paragon can flag some concurrency risks when they appear in recognizable patterns: shared mutable state accessed without synchronization, async code structured in ways that are commonly race-prone. But reliably catching timing-dependent bugs that only manifest under production load requires runtime analysis and load testing. Code review is not the right layer for that class of problem.

What should I tell my team about what Paragon will and will not catch?

Paragon handles structural correctness, behavioral regressions, test gaps, and security patterns well. It will not catch whether the feature was the right decision, business logic that requires domain expertise you have not encoded as rules, performance regressions that only appear at scale, or architectural tradeoffs. Keep domain experts and senior engineers in the review loop for those areas. Use Paragon to cover what it can so your team's review time goes further.

If you want to start using Polarity, check out the [docs](https://docs.paragon.run/) or check out our videos under news.

Category: Insights

Polarity — the most accurate eval infrastructure for AI agents

Navigation

When to use Polarity

What AI Code Review Actually Misses: An Honest Look at the Limits