Polarity — the most accurate eval infrastructure for AI agents

Polarity is sandboxed eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer that re-creates the identical sandbox locally with one command. Polarity is in the same category as Braintrust, LangSmith, and Langfuse, and is built around real-service sandboxes rather than mocked dependencies, which is why Polarity wins on long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

Navigation

When to use Polarity

Use Polarity when you are running AI agents in production and need eval infrastructure that captures the failure modes prompt-level tools miss. Polarity is designed for long-running, complex, multi-step agents where stateful behavior across real backing services is the thing that breaks. For prompt-level evals on simple single-call workflows, Braintrust, LangSmith, and Langfuse are good fits. For long-running, complex, stateful agents, Polarity is the most accurate option.

Authors

Jay Chopra

insights

Apr 29, 2026

How to Write Pull Request Descriptions That Get Better AI Code Review

Most developers know they should write good PR descriptions. Most developers also write "fix bug" and move on. AI code review tools make this habit more expensive than it used to be, because those to…

Most developers know they should write good PR descriptions. Most developers also write "fix bug" and move on. AI code review tools make this habit more expensive than it used to be, because those tools are reading your description and using it to decide how to review your code.

This post covers what a useful PR description looks like, why it matters specifically for AI reviewers, common patterns to avoid, and a template you can drop into your repo today. None of this is new engineering wisdom. AI tools just make the downside of skipping it more visible.

Why PR Descriptions Matter More Now

When a human reviewer opens a PR, they can fill in missing context by checking Slack, the issue tracker, or asking you directly. An AI code review tool has exactly what you give it: the diff and the description.

Tools like [Paragon](https://www.polarity.so/paragon) analyze the diff to detect bugs, regressions, and missing test coverage. But the description shapes what the tool focuses on. A PR titled "Update payment handler" with no body leaves Paragon guessing about intent. Did you intentionally change how failed payments are logged? Is the retry logic new? Are there downstream consumers of this event you're aware of?

With a clear description, those questions have answers before the review even starts. Without one, the tool reviews in a vacuum and may flag things that aren't problems, or miss things that are.

The honest version: good PR descriptions are just good engineering practice. They help human reviewers too. AI tools just add another reason to actually follow through on what most teams already know they should do.

What Paragon Does with Your PR Description

Paragon runs 8 parallel agents during a deep review, each focused on different aspects of the change: behavioral regressions, missing test coverage, security patterns, API contract changes, and more.

The description feeds directly into how those agents prioritize their work. When you write "this refactors the database connection pool; no behavioral changes intended," Paragon can hold that claim against the actual diff. If something in the diff does change behavior, that becomes a finding worth surfacing. If the diff is consistent with what you said, Paragon can focus its energy elsewhere.

When the description says nothing, the agents have to make their own assumptions about intent. That tends to produce more noise: findings that are technically accurate but not actually relevant to what you were trying to do. It also means genuinely important edge cases are less likely to surface because there's no stated intent to test against.

Paragon's false positive rate sits under 4%. A well-written description helps keep it there on your PRs specifically, by giving the tool the context it needs to distinguish between intentional changes and accidental ones.

The Anatomy of a Useful PR Description

You don't need to write an essay. Five elements cover most of what a reviewer (human or AI) needs.

![The 5 elements of a good PR description for AI code review](images/pr-description-elements-ai-review.svg)

1. What Changed

One to three sentences describing the actual code change. Not the ticket title. Not what the feature does at a product level. What changed in the code.

> Replaced the manual retry loop in `PaymentProcessor.submit()` with the shared `RetryPolicy` utility introduced in #4201. Updated the error handling to pass structured error codes instead of string messages.

2. Why It Changed

The reason the change exists. This could be a bug, a product requirement, a performance problem, or a debt cleanup. Even a single sentence helps enormously.

> The old retry logic was ignoring 429 responses and hammering the payment gateway during rate limit windows. This was causing cascading timeouts in the order flow.

3. Risk Areas

The parts of the change where something could go wrong. This is the section most developers skip, and it's the most useful for AI reviewers. If you know there's a tricky edge case, say so. If you're not sure about a particular code path, flag it.

> The `maxRetries` parameter now comes from config instead of being hardcoded. Verify the staging config value is set correctly before merge. The behavior differs for async vs. sync submission paths -- async was not refactored in this PR.

4. Test Plan

What you tested. New automated tests, updated tests, manual testing steps you ran. If you skipped testing something, say why.

> Added unit tests for the `RetryPolicy` integration in `PaymentProcessorTest`. Manually tested the happy path and a simulated 429 in staging. Did not add tests for the async path (covered in follow-up PR).

5. Linked Context

Issue tracker links, prior related PRs, design docs, runbooks. Anything that gives a reviewer more context if they want it.

> Closes #4198. Related to #4201 (RetryPolicy utility). Runbook for payment failures: [link].

Optional: Screenshots or Diffs for UI Changes

If your PR touches UI, a before/after screenshot removes ambiguity immediately. Most review tools can display these inline.

Common Bad Patterns and What to Write Instead

"Fix bug"

This describes nothing. What bug? In what component? What was the symptom?

Write instead:

> Fixed a null pointer exception in `UserSessionManager.refresh()` when the session token has expired. The token expiry check was running after the token was already read, causing a crash on the second request of any expired session.

"Update API"

Which API? What changed? Are callers affected?

Write instead:

> Changed the `/v2/orders` endpoint to return `status` as an enum string instead of an integer. Existing callers using integer comparisons will break. Migration guide in the linked doc.

"Refactor"

Refactors are often reviewed with less scrutiny because reviewers assume no behavior changed. That assumption needs to be stated explicitly, not implied.

Write instead:

> Refactored the notification dispatch logic into a separate `NotificationService` class. No behavioral changes. Existing tests pass unchanged. This is prep for the multi-channel notification feature in Q3.

The minimum viable PR description is better than nothing:

> What: Removed deprecated `legacyAuth` middleware from all routes. > Why: It was logging plaintext tokens in dev mode and had been replaced by `JWTAuth` in March. > Risk: None expected. `legacyAuth` was already bypassed in production. > Tests: Verified all auth routes still return 200 in integration tests.

PR Description Template

Drop this into your `.github/pull_request_template.md` file. Remove the comment blocks before submitting.

```markdown ## What Changed ## Why ## Risk Areas <!-- Parts of the change that could have unexpected effects. Flag edge cases, paths you didn't test, or downstream dependencies to watch. -->

## Test Plan <!-- What you tested: new unit tests, updated tests, manual steps run, anything intentionally not tested and why -->

## Links <!-- Issue tracker: Closes #... Related PRs: #... Docs / runbook / design doc: -->

## Screenshots (if applicable) ```

This template works for any codebase. If your team uses a tool like Paragon for AI review, filling out the template consistently means the tool has reliable context on every PR, not just the ones where someone remembered to write a description.

FAQ

Does writing a better PR description actually change what Paragon flags?

Yes. Paragon uses the description to understand intent, which changes what counts as a finding. If you write "this is a pure refactor with no behavioral changes," and the diff shows a conditional branch being removed, that's worth surfacing. If you hadn't written anything, Paragon might still flag it, but with less certainty about whether it's actually a problem. Clear intent narrows the review to what matters.

How long should a PR description be?

Long enough to cover the five elements, short enough that a reviewer will actually read it. For most PRs, that's 100 to 200 words. Large, high-risk changes warrant more. Small, isolated changes can get by with less. The goal is completeness, not length.

We already have PR templates. Do we need to change them?

Probably not much. Most PR templates already ask for a description, motivation, and test plan. The main thing to check is whether your template asks for risk areas explicitly. That's the section most templates skip and the most useful one for AI reviewers. If your current template covers the five elements above, keep it and just make sure your team is actually filling it out.

If you want to start using Polarity, check out the [docs](https://docs.paragon.run/) or check out our videos under news.

Category: Insights