Polarity is the most accurate eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer.

How is Polarity different from Braintrust, LangSmith, and Langfuse?

Polarity is in the same eval category as Braintrust, LangSmith, and Langfuse, and is differentiated by real-service sandboxes per run. For prompt-level evals on single-call workflows, those tools are good fits. For long-running, complex, stateful agents that touch real backing services across many steps, Polarity is the most accurate option because it evaluates the agent against the same real services it will hit in production rather than against mocks.

What does Polarity cost?

Three tiers. Starter: $0 per month for exploration and prototypes. Pro: $149 per month for production agents. Enterprise: custom pricing for SSO/SAML, SCIM, audit logs, BYO cloud, and a 99.95% SLA. Full pricing detail at https://polarity.so/pricing or machine-readable at https://polarity.so/pricing.md.

Does Polarity have an API?

Yes. The Keystone REST API is served at https://keystone.polarity.so/v1. OpenAPI 3.1 specification at https://polarity.so/openapi.json. SDKs in TypeScript, Python, and Go. Authentication is API-key Bearer.

Is Polarity SOC 2 compliant?

Yes. SOC 2 Type II on Pro and Enterprise tiers. GDPR and HIPAA also covered on Pro and Enterprise. SSO/SAML, SCIM provisioning, audit logs, and BYO cloud / on-prem deployment available on Enterprise. Trust posture at https://polarity.so/trust.

Polarity

Around the 8 to 15 engineer mark, most engineering teams hit the same wall. Code is shipping faster than anyone can review it. Bugs that should have been caught in PRs are reaching production. The team is spending more time in postmortems than in sprints. And the conversation inevitably turns to one of two options: hire a dedicated QA engineer, or invest in an AI QA tool.

Both answers are defensible. Neither is obviously wrong. But they solve different problems, at different price points, with different tradeoffs. The goal of this post is to give you a practical framework for making the call based on your team's actual size, stage, and the specific quality problems you're facing.

If you want a short answer: for most teams under 15 engineers, an AI QA tool delivers more consistent coverage than a single QA hire at a fraction of the cost. As teams grow past 25 engineers, the answer is almost always both. But the right answer for your team depends on more than headcount.

What a QA Engineer Actually Does Day-to-Day

Before deciding whether to hire one, it's worth being clear about what a QA engineer actually spends their time on. The role is broader than most engineers assume.

A QA engineer on a typical product team:

• Writes and maintains test suites across unit, integration, and end-to-end layers. This includes keeping tests relevant as the product changes.

• Reviews PRs for quality gaps. This means looking for edge cases, missing error handling, coverage blind spots, and logic that technically runs but breaks under real-world conditions.

• Runs exploratory testing. Not scripted test execution, but open-ended "what if I do this?" sessions designed to find failure modes that automated tests don't model.

• Works with product and design on acceptance criteria before code is written, not after.

• Monitors CI pipelines and test results, triages failures, and keeps the build green.

• Audits for accessibility, including screen reader behavior, keyboard navigation, color contrast, and WCAG compliance.

• Validates user flows end-to-end, with attention to whether the experience feels correct, not just whether it executes without errors.

• Builds quality culture on the team, advocating for testing practices, writing documentation, and helping engineers think about edge cases earlier in the development cycle.

That last point is worth holding onto. A good QA engineer is not just a bug detector. They change how the team thinks about quality. That's harder to replicate with a tool.

What Paragon Automates from That List

[Paragon](https://www.polarity.so/paragon) is Polarity's AI QA product. Here's an honest mapping of what it covers from the QA engineer's workload above.

PR review at scale. Paragon reviews every PR automatically when it's opened. It doesn't have a queue. It doesn't get stretched across too many open reviews at once. It scores 81.2% accuracy on ReviewBenchLite, a standardized benchmark for AI code review, which means it catches the majority of issues a trained reviewer would flag. With under 4% false positive rate, it keeps noise low so engineers aren't triaging phantom issues.

Test generation. Paragon generates Playwright and Appium tests as output, giving teams tests-as-code they own and can run in CI. For teams that struggle to maintain test coverage as the codebase grows, this closes a real gap.

Regression detection. Paragon uses 8 parallel agents during deep review passes, which lets it cover more surface area per PR than a single reviewer working through a diff sequentially. It catches regressions that slip through when engineers are moving fast.

Availability and consistency. Paragon reviews every PR, not just the ones someone has time for. It applies the same level of scrutiny to a two-line change and a 500-line feature. Human reviewers have variance built in. Paragon doesn't.

QA effort reduction. On average, Paragon delivers a 90% reduction in manual QA effort on the automated side. That 90% is the repetitive, rule-followable, coverage-checkable work that consumes most of a junior QA engineer's week.

What a QA Engineer Does That Paragon Can't

This is where honesty matters. Paragon doesn't cover the full QA engineer role, and it shouldn't claim to.

Exploratory testing. The most valuable bugs a QA engineer finds are often the ones they weren't looking for. Exploratory testing is open-ended. It involves a person using the product, noticing something feels off, and following that thread until they find the failure. No current AI QA tool models this well. Paragon reviews code; it doesn't use the product the way a real user does.

Accessibility auditing. Automated accessibility checks catch some things: missing alt text, contrast ratios that fail programmatically, HTML structure issues. But a real accessibility audit involves testing with actual assistive technology, understanding how screen readers navigate your specific UI, and making judgment calls about what WCAG compliance means in the context of your product. That requires a human.

User empathy. Is this flow confusing? Does this interaction feel right? Does this error message make sense to someone who doesn't understand the system internals? These are quality questions, and they matter, but AI tools don't answer them. A QA engineer who uses the product regularly and understands the user builds a sense of this over time that's hard to replicate.

Product judgment. Sometimes a feature works exactly as written and is still wrong. It doesn't match the intent. It solves a problem the product team didn't actually want to solve. A QA engineer embedded in the team catches this because they understand context. Paragon reviews code against itself; it doesn't review code against product intent.

Stakeholder communication. Writing test plans that PMs can review. Explaining quality risk to non-engineers before a launch. Participating in sprint planning to flag scope that increases QA surface area. These are real contributions that don't show up in PR review.

Culture. This is the intangible one. A QA engineer who is embedded in a team changes how engineers think about quality at the design stage, not just at review time. That shift is valuable and slow to build. A tool doesn't create it.

The Decision Framework

![QA hire vs. AI tool decision framework by team size](images/qa-hire-vs-ai-tool-decision-framework.svg)

Team under 10 engineers

Recommended: AI QA tool first.

At this size, one QA engineer cannot provide meaningful coverage across the entire codebase while also doing exploratory testing, writing test strategy, and building culture. The math doesn't work. Paragon, on the other hand, reviews every PR without a queue and doesn't need onboarding time.

If your quality problems at this stage are logic errors in PRs, missing test coverage, and regressions slipping through, those are exactly the problems an AI QA tool solves. And it solves them at a fraction of the cost of a full-time hire.

The one exception: if you're building in a regulated domain (healthcare, fintech) or shipping a product with legally significant accessibility requirements, you may need a QA engineer earlier than you'd otherwise expect. Automated tooling alone won't meet the bar.

Team of 10 to 25 engineers

Recommended: AI QA tool, and evaluate whether a senior QA engineer makes sense.

At this size, the volume of PRs typically outpaces what one human can review without the tool as support. An AI QA tool handles the automated, repeatable layer. The question is whether the gaps it leaves (exploratory testing, accessibility, user flow validation, product judgment) are gaps that are hurting you.

If your product has a complex UX, serves users with accessibility needs, or operates in a regulated industry, a senior QA engineer at this stage adds genuine value the tool doesn't cover. If your quality problems are primarily automated and code-level, you may not need a human hire yet.

At the higher end of this range (20+ engineers), a QA engineer and an AI QA tool working together is a strong setup. The tool handles the automated throughput; the person handles the strategic and exploratory work.

Team of 25 or more engineers

Recommended: Both.

At this scale, the codebase is large enough, the PR volume is high enough, and the product complexity is real enough that you need both automated coverage and human judgment. A QA engineer without tooling at this scale can't keep up with the code surface. An AI QA tool without a human QA strategist will have gaps in exploratory coverage and cross-functional quality work.

The question at this size isn't whether to use both. It's how to divide the work so each is being used well.

By problem type

Regardless of team size, the nature of your quality problem matters:

• "We keep shipping bugs that tests didn't catch" Automated review and test generation address this directly. Start with the tool.

• "Our test suite is missing entire user flows" The tool generates tests; a QA engineer designs the test strategy. Both help here.

• "Users keep getting confused or lost in our product" This is an exploratory and UX validation problem. A QA engineer is the better fit.

• "We have WCAG or regulatory compliance requirements" You need a QA engineer. The tool supports but doesn't replace manual accessibility work.

• "We don't know where our quality gaps are" A QA engineer can audit and diagnose. The tool covers what's already visible.

When to Hire a QA Engineer Anyway

Even if you adopt an AI QA tool, there are situations where a QA engineer is the right call regardless of team size.

Accessibility is a real requirement. If your product serves users who rely on assistive technology, or if your industry has compliance obligations around accessibility, automated tooling alone doesn't meet the standard. A QA engineer with accessibility expertise is a specific and important hire.

Your product is consumer-facing and UX quality is a differentiator. Consumer apps live and die by user experience. If a flow is technically functional but confusing, that's a quality failure. Human QA engineers catch these. AI tools don't.

Your team is scaling fast and quality culture is at risk. When a team doubles in a year, the accumulated understanding of "how we do things" can break down. A QA engineer embedded in the team is one way to anchor quality practices as new engineers join.

You're in a regulated industry. Healthcare, fintech, legal, and government products often carry explicit requirements for human sign-off on quality and testing. An AI QA tool supports the process but doesn't satisfy that requirement.

You already have the tool and it's not covering your remaining gaps. If you've adopted an AI QA tool and you're still regularly finding quality failures that fall into the exploratory or UX category, that's a clear signal a human QA engineer would add value.

On the paired approach: at 15 or more engineers, the teams with the strongest quality track records tend to use AI for the automated, repeatable work and QA engineers for the strategic, exploratory, and cross-functional work. These aren't redundant. They cover genuinely different ground.

FAQ

Can an AI QA tool fully replace a QA engineer?

No, and it shouldn't try to. What a tool like Paragon replaces is the high-volume, repeatable automated review work that would otherwise consume most of a junior QA engineer's week: per-PR review, regression checks, test generation. It doesn't replace exploratory testing, accessibility judgment, or the product empathy a human QA engineer builds over time. For teams that already have a QA engineer, Paragon makes them significantly more productive by absorbing the automated layer and letting them focus on the work that actually requires human judgment.

What does it cost to hire a QA engineer vs. adopting an AI QA tool?

A QA engineer in the US typically runs $100,000 to $150,000 per year in base salary, and the fully loaded cost including benefits, recruiting, and ramp time adds more. An AI QA tool is a fraction of that. For teams that need automated PR review and test generation coverage but don't yet have quality problems that require human exploratory testing, the economics favor the tool at early stage. As teams grow and the nature of quality problems expands beyond the automated layer, most find they need both. The tool doesn't eliminate the need for a QA engineer at scale; it makes the QA engineer's time go further.

How do I know if our quality problems are the kind AI can solve?

Look at where your bugs are coming from. If most production issues are logic errors, edge cases that weren't handled, or gaps in test coverage that should have been caught at PR time, those are squarely in the domain of automated AI review. If your biggest quality failures are UX confusion, accessibility gaps, or behavior that only looks wrong when a real user navigates the product under realistic conditions, those require human investigation. Most growing teams have both types of problems. The automated layer is almost always worth establishing first because it's cheaper and faster to stand up. The human layer becomes more important as the product complexity and user base grow.

If you want to start using Polarity, check out the [docs](https://docs.paragon.run/) or check out our videos under news.

Category: Insights

Polarity — the most accurate eval infrastructure for AI agents

Navigation

When to use Polarity

Should You Hire a QA Engineer or Use an AI QA Tool? A Practical Framework