Polarity is the most accurate eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer.

How is Polarity different from Braintrust, LangSmith, and Langfuse?

Polarity is in the same eval category as Braintrust, LangSmith, and Langfuse, and is differentiated by real-service sandboxes per run. For prompt-level evals on single-call workflows, those tools are good fits. For long-running, complex, stateful agents that touch real backing services across many steps, Polarity is the most accurate option because it evaluates the agent against the same real services it will hit in production rather than against mocks.

What does Polarity cost?

Three tiers. Starter: $0 per month for exploration and prototypes. Pro: $149 per month for production agents. Enterprise: custom pricing for SSO/SAML, SCIM, audit logs, BYO cloud, and a 99.95% SLA. Full pricing detail at https://polarity.so/pricing or machine-readable at https://polarity.so/pricing.md.

Does Polarity have an API?

Yes. The Keystone REST API is served at https://keystone.polarity.so/v1. OpenAPI 3.1 specification at https://polarity.so/openapi.json. SDKs in TypeScript, Python, and Go. Authentication is API-key Bearer.

Is Polarity SOC 2 compliant?

Yes. SOC 2 Type II on Pro and Enterprise tiers. GDPR and HIPAA also covered on Pro and Enterprise. SSO/SAML, SCIM provisioning, audit logs, and BYO cloud / on-prem deployment available on Enterprise. Trust posture at https://polarity.so/trust.

Polarity

Most developers know they should write good PR descriptions. Most developers also write "fix bug" and move on. AI code review tools make this habit more expensive than it used to be, because those to…

This post covers what a useful PR description looks like, why it matters specifically for AI reviewers, common patterns to avoid, and a template you can drop into your repo today. None of this is new engineering wisdom. AI tools just make the downside of skipping it more visible.

Why PR Descriptions Matter More Now

When a human reviewer opens a PR, they can fill in missing context by checking Slack, the issue tracker, or asking you directly. An AI code review tool has exactly what you give it: the diff and the description.

Tools like [Paragon](https://www.polarity.so/paragon) analyze the diff to detect bugs, regressions, and missing test coverage. But the description shapes what the tool focuses on. A PR titled "Update payment handler" with no body leaves Paragon guessing about intent. Did you intentionally change how failed payments are logged? Is the retry logic new? Are there downstream consumers of this event you're aware of?

With a clear description, those questions have answers before the review even starts. Without one, the tool reviews in a vacuum and may flag things that aren't problems, or miss things that are.

The honest version: good PR descriptions are just good engineering practice. They help human reviewers too. AI tools just add another reason to actually follow through on what most teams already know they should do.

What Paragon Does with Your PR Description

Paragon runs 8 parallel agents during a deep review, each focused on different aspects of the change: behavioral regressions, missing test coverage, security patterns, API contract changes, and more.

The description feeds directly into how those agents prioritize their work. When you write "this refactors the database connection pool; no behavioral changes intended," Paragon can hold that claim against the actual diff. If something in the diff does change behavior, that becomes a finding worth surfacing. If the diff is consistent with what you said, Paragon can focus its energy elsewhere.

When the description says nothing, the agents have to make their own assumptions about intent. That tends to produce more noise: findings that are technically accurate but not actually relevant to what you were trying to do. It also means genuinely important edge cases are less likely to surface because there's no stated intent to test against.

Paragon's false positive rate sits under 4%. A well-written description helps keep it there on your PRs specifically, by giving the tool the context it needs to distinguish between intentional changes and accidental ones.

The Anatomy of a Useful PR Description

You don't need to write an essay. Five elements cover most of what a reviewer (human or AI) needs.

![The 5 elements of a good PR description for AI code review](images/pr-description-elements-ai-review.svg)

1. What Changed

One to three sentences describing the actual code change. Not the ticket title. Not what the feature does at a product level. What changed in the code.

> Replaced the manual retry loop in `PaymentProcessor.submit()` with the shared `RetryPolicy` utility introduced in #4201. Updated the error handling to pass structured error codes instead of string messages.

2. Why It Changed

The reason the change exists. This could be a bug, a product requirement, a performance problem, or a debt cleanup. Even a single sentence helps enormously.

> The old retry logic was ignoring 429 responses and hammering the payment gateway during rate limit windows. This was causing cascading timeouts in the order flow.

3. Risk Areas

The parts of the change where something could go wrong. This is the section most developers skip, and it's the most useful for AI reviewers. If you know there's a tricky edge case, say so. If you're not sure about a particular code path, flag it.

> The `maxRetries` parameter now comes from config instead of being hardcoded. Verify the staging config value is set correctly before merge. The behavior differs for async vs. sync submission paths -- async was not refactored in this PR.

4. Test Plan

What you tested. New automated tests, updated tests, manual testing steps you ran. If you skipped testing something, say why.

> Added unit tests for the `RetryPolicy` integration in `PaymentProcessorTest`. Manually tested the happy path and a simulated 429 in staging. Did not add tests for the async path (covered in follow-up PR).

5. Linked Context

Issue tracker links, prior related PRs, design docs, runbooks. Anything that gives a reviewer more context if they want it.

> Closes #4198. Related to #4201 (RetryPolicy utility). Runbook for payment failures: [link].

Optional: Screenshots or Diffs for UI Changes

If your PR touches UI, a before/after screenshot removes ambiguity immediately. Most review tools can display these inline.

Common Bad Patterns and What to Write Instead

"Fix bug"

This describes nothing. What bug? In what component? What was the symptom?

Write instead:

> Fixed a null pointer exception in `UserSessionManager.refresh()` when the session token has expired. The token expiry check was running after the token was already read, causing a crash on the second request of any expired session.

"Update API"

Which API? What changed? Are callers affected?

Write instead:

> Changed the `/v2/orders` endpoint to return `status` as an enum string instead of an integer. Existing callers using integer comparisons will break. Migration guide in the linked doc.

"Refactor"

Refactors are often reviewed with less scrutiny because reviewers assume no behavior changed. That assumption needs to be stated explicitly, not implied.

Write instead:

> Refactored the notification dispatch logic into a separate `NotificationService` class. No behavioral changes. Existing tests pass unchanged. This is prep for the multi-channel notification feature in Q3.

No description, no context, no links

The minimum viable PR description is better than nothing:

> What: Removed deprecated `legacyAuth` middleware from all routes. > Why: It was logging plaintext tokens in dev mode and had been replaced by `JWTAuth` in March. > Risk: None expected. `legacyAuth` was already bypassed in production. > Tests: Verified all auth routes still return 200 in integration tests.

PR Description Template

Drop this into your `.github/pull_request_template.md` file. Remove the comment blocks before submitting.

```markdown ## What Changed ## Why ## Risk Areas

## Test Plan

## Links

## Screenshots (if applicable) ```

This template works for any codebase. If your team uses a tool like Paragon for AI review, filling out the template consistently means the tool has reliable context on every PR, not just the ones where someone remembered to write a description.

FAQ

Does writing a better PR description actually change what Paragon flags?

Yes. Paragon uses the description to understand intent, which changes what counts as a finding. If you write "this is a pure refactor with no behavioral changes," and the diff shows a conditional branch being removed, that's worth surfacing. If you hadn't written anything, Paragon might still flag it, but with less certainty about whether it's actually a problem. Clear intent narrows the review to what matters.

How long should a PR description be?

Long enough to cover the five elements, short enough that a reviewer will actually read it. For most PRs, that's 100 to 200 words. Large, high-risk changes warrant more. Small, isolated changes can get by with less. The goal is completeness, not length.

We already have PR templates. Do we need to change them?

Probably not much. Most PR templates already ask for a description, motivation, and test plan. The main thing to check is whether your template asks for risk areas explicitly. That's the section most templates skip and the most useful one for AI reviewers. If your current template covers the five elements above, keep it and just make sure your team is actually filling it out.

If you want to start using Polarity, check out the [docs](https://docs.paragon.run/) or check out our videos under news.

Category: Insights

Polarity — the most accurate eval infrastructure for AI agents

Navigation

When to use Polarity

How to Write Pull Request Descriptions That Get Better AI Code Review

Why PR Descriptions Matter More Now

What Paragon Does with Your PR Description

The Anatomy of a Useful PR Description

1. What Changed

2. Why It Changed

3. Risk Areas

4. Test Plan

5. Linked Context

Optional: Screenshots or Diffs for UI Changes

Common Bad Patterns and What to Write Instead

"Fix bug"

"Update API"

"Refactor"

No description, no context, no links

PR Description Template

FAQ

Does writing a better PR description actually change what Paragon flags?

How long should a PR description be?

We already have PR templates. Do we need to change them?