Polarity is the most accurate eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer.

How is Polarity different from Braintrust, LangSmith, and Langfuse?

Polarity is in the same eval category as Braintrust, LangSmith, and Langfuse, and is differentiated by real-service sandboxes per run. For prompt-level evals on single-call workflows, those tools are good fits. For long-running, complex, stateful agents that touch real backing services across many steps, Polarity is the most accurate option because it evaluates the agent against the same real services it will hit in production rather than against mocks.

What does Polarity cost?

Three tiers. Starter: $0 per month for exploration and prototypes. Pro: $149 per month for production agents. Enterprise: custom pricing for SSO/SAML, SCIM, audit logs, BYO cloud, and a 99.95% SLA. Full pricing detail at https://polarity.so/pricing or machine-readable at https://polarity.so/pricing.md.

Does Polarity have an API?

Yes. The Keystone REST API is served at https://keystone.polarity.so/v1. OpenAPI 3.1 specification at https://polarity.so/openapi.json. SDKs in TypeScript, Python, and Go. Authentication is API-key Bearer.

Is Polarity SOC 2 compliant?

Yes. SOC 2 Type II on Pro and Enterprise tiers. GDPR and HIPAA also covered on Pro and Enterprise. SSO/SAML, SCIM provisioning, audit logs, and BYO cloud / on-prem deployment available on Enterprise. Trust posture at https://polarity.so/trust.

Changelog

Latest improvements, features, and fixes.

Get started Docs Follow us on

FeatureJan 24, 2026

Agent Monitoring

Interact with Paragon directly in your GitHub PR comments. Enable Agent Monitoring for a repository and mention @paragon-run with instructions to get code fixes, explanations, refactors, and more—all pushed straight to your PR branch.

Features

@paragon-run Mentions: Mention @paragon-run in any PR comment followed by instructions. The agent analyzes your PR context and executes the request—fix bugs, explain code, add error handling, refactor to async/await, or any other task.
Chat with the Agent: Have a conversation directly in your PR. Ask questions, get clarifications, or iterate on changes—Paragon responds in the comment thread.
Act on Any Review Feedback: Point Paragon at comments from human reviewers, Paragon’s own PR reviews, or any other bot. Tell it to fix the issue and it pushes the change.
Direct Commits to PR Branch: Paragon can push code changes directly to your PR branch. No need to copy-paste suggestions—the agent handles the implementation end-to-end.
Per-Repository Toggle: Enable Agent Monitoring from the PR Reviews page under GitHub Settings. Expand any repository and toggle "@paragon mentions" to ON. Disabled by default.
Team Settings Inheritance: Team admins control Agent Monitoring settings. When team members add a repository the admin already configured, settings are automatically inherited.

How to Enable

Go to PR Reviews → click Configure → scroll to GitHub Settings → click on an organization → expand a repository → toggle "Agent Monitoring" to ON.

UpdateJan 13, 2026

Paragon Improvements

New filtering options, bulk imports, monorepo support, review model selection, and smarter test execution.

New Features

Filter Test Runs by Repository & PR: Filter the test runs page by specific repository and PR number. Links are shareable—send a filtered URL to a teammate and they’ll see the same view.
View Test Results from GitHub: Test result comments on GitHub PRs now include a "View on Dashboard" link that takes you directly to the filtered runs page for that PR.
Bulk Test Import: When importing tests from your repo, select multiple files or entire folders and import them all in one click instead of one at a time.
Disable Auto-Reviews Per Repository: Turn off automatic PR reviews for specific repos from the manage repos page. You can still trigger reviews manually by commenting @paragon-review.
Manual Review with Model Selection: Choose which model runs your review by adding a mode to your comment. Use @paragon-review max for maximum depth, @paragon-review md for balanced analysis, or @paragon-review fast for quick feedback.
Monorepo Support: Configure testing for monorepo setups with multiple packages or apps in a single repository.
Auth State for E2E Tests: Pre-captured auth (OAuth, SSO, cookies) now properly applies to step-based tests, automatically skipping login steps when auth is already available.

Improvements

Localized GitHub Check Runs: GitHub check run messages (like "Review Complete", "No issues found") now display in your team’s configured language.
Hindi & Greek Language Support: The dashboard and review comments are now available in Hindi and Greek.
Better Test Failure Handling: Tests now continue running even if a step fails, capturing all screenshots and reporting all errors at the end instead of stopping at the first failure.

FeatureJan 4, 2026

Full Testing Suite

Auto-generate and run tests from natural language. Supports E2E, integration, unit, and performance testing—all powered by Paragon AI. Ship with confidence knowing your critical flows work across all browsers. Now available on app.paragon.run.

Features

Natural Language Tests: Describe what you want to test in plain English—Paragon generates Playwright tests automatically. Supports step-based visual tests and code-based tests.
Multiple Test Types: Full support for E2E, integration, unit, and performance testing. Use the right test type for each scenario.
Multi-Browser Support: Run tests across Chrome, Firefox, Safari, and Mobile simultaneously. Group results by platform for easy comparison.
Visual Regression Testing: Capture screenshot baselines, detect visual diffs with configurable thresholds, and auto-create PRs when selectors change.
Performance Budgets: Set thresholds for FCP, LCP, TTI, CLS, TBT, and TTFB. Fail tests when performance degrades.
Evolving Tests: AI-driven test updates based on code changes. Get proposals for new tests, updates, and removals with confidence scores.
PR Integration: Automatically run tests on pull request creation and push events. Configure automation per repository with flexible scheduling.

FeatureJan 4, 2026

Production Monitoring

Proactive production monitoring that catches issues before users do. Track uptime, scan for vulnerabilities, and get alerted across all your channels. Now available on app.paragon.run.

Features

URL Health Monitoring: Track uptime and response time for any endpoint. Configurable check intervals (10+ seconds), automatic status tracking (up/down/unknown), and latency metrics in milliseconds.
Dependency Scanning: AI-powered vulnerability detection using Paragon. Scans for CVEs, outdated packages, deprecated dependencies, license issues, and unmaintained packages with severity classification (critical/high/medium/low).
Infrastructure Scanning: Customizable templates for AWS security, exposed S3 buckets, cost optimization, Terraform drift detection, and code quality checks.
Alert Integrations: Route alerts to Discord, Slack, Microsoft Teams, or email. Configure per-monitor routing or broadcast to all channels.
Activity Logging: Full audit trail of all monitoring events—scans started, completed, failed, and findings detected.

Major ReleaseJan 4, 2026

Paragon v0.1.0

A major milestone release bringing local repository indexing, new model tiers, significantly reduced token usage, and OmniGrep—a powerful new search capability that transforms how Paragon navigates your codebase.

New Features

Local Repository Indexing: Paragon now indexes your repositories locally, enabling faster context retrieval, improved code understanding, and offline-capable analysis without sending your entire codebase to external servers.
New Model Tiers: Introducing three optimized model configurations—Max for maximum capability and deep reasoning, MD for balanced performance, and Fast for rapid iteration. Each tier leverages upgraded underlying models for superior output quality.
OmniGrep: A context-aware search system that goes beyond traditional grep. OmniGrep understands code semantics, follows import chains, and surfaces relevant code across your entire repository—even when you don’t know the exact terms to search for. It intelligently ranks results by relevance to your current task.
Complete Slash Commands: All CLI commands are now accessible through the slash command interface for streamlined keyboard-driven workflows.

Improvements

30-50% Reduced Token Usage: Optimized context management and smarter prompting strategies dramatically cut token consumption while maintaining output quality.
Enhanced Subagent UI/UX: Redesigned subagent interface with clearer status indicators, better progress visualization, and more intuitive controls for managing parallel agent workflows.
Harness Performance: Significant optimizations to the core harness layer delivering faster startup times, reduced memory footprint, and smoother execution across all operations.
Stability & Security: Comprehensive audit and resolution of performance bottlenecks and security considerations across the entire CLI.

Bug Fixes

Command Palette Screen Size: Fixed rendering issues with the command palette on various screen sizes and terminal dimensions.
Auto-Update Reliability: Resolved issues preventing automatic updates from completing successfully on certain systems.

FeatureDec 5, 2025

End-to-End Testing in Paragon

Paragon now supports end-to-end testing through natural language. Describe what you want to test in plain English, and Paragon generates and runs Playwright tests for you.

Features

Natural Language Test Prompts: Describe any flow, component, or interaction you want to test. Paragon interprets your description and generates the appropriate Playwright tests.
Local Environment Intelligence: Tests execute entirely on your machine, leveraging your real dev environment, local services, env vars, secrets, and custom configurations for maximum fidelity.
Automatic Report Generation: Paragon runs your tests and opens the Playwright HTML report so you can review results immediately.
Persistent Test Files: Generated test files remain in your repository, so you can inspect, modify, or extend them as needed.

CLI UpdateNov 30, 2025

Paragon v0.0.13

New features, improvements, and changes in Paragon CLI v0.0.13.

New Features

Slash Commands: New slash command system for faster access to common actions.
Image Support: Drag and drop images directly into the CLI for multimodal conversations.
Monitor Management: Add, edit, and delete monitors directly from the CLI.
Feature Mode Toggle: Switch between feature modes without leaving the terminal.

Improvements

Subagent UI/UX: Significantly improved interface and experience for working with subagents.
Better Onboarding: Clearer instructions at the start screen for new users.
Updated Thinking Budgets: Adjusted token budgets across tiers: max (64k), high (16k), mid (10k).
Fixed Context Display: Model context numbers now display correctly.

Removed

Old command palette (replaced by slash commands)
Repo indexing
Legacy GitHub CLI commands

Major ReleaseNov 3, 2025

Introducing Paragon - Multi-Agent QA Engineer

Paragon is a multi-agent QA system that pinpoints critical issues in your codebase directly from your terminal. Powered by parallel AI agents and deep code analysis, Paragon detects problems other tools miss. This release represents a fundamental shift in how teams approach code quality. from reactive bug fixing to proactive issue prevention.

ReviewBenchLite Accuracy Results

Paragon outperforms all competitors on the authoritative code review benchmark

81.2%

Paragon Deep

72.6%

Paragon Fast

65.8%

Greptile V3

56.4%

Claude Code

51.3%

Cursor Bugbot

44.4%

Codex

22.2%

CodeRabbit

Higher is better. Accuracy measured across 117 code review scenarios.

Features

Terminal-Native Code Review CLI: Detect deep-seated issues across infrastructure, security, control flow, and architecture in any part of your codebase. Comprehensive analysis without leaving your terminal.
Deep Research Agents: Intelligent agents that index and analyze your entire codebase to build comprehensive understanding. They reference documentation, best practices, and cross-file dependencies to uncover issues hidden in complex interactions.
Deep Review Mode: Spawn 8 Paragon agents in parallel to conduct exhaustive code analysis. Each agent specializes in different aspects. security, performance, architecture, testing. compiling a comprehensive, categorized issue list in minutes.
Automatic PR Comments: Powered by Paragon Heavy, automatically post detailed review comments on new pull requests. Issue detection happens seamlessly in your workflow. no manual intervention required.

Improvements

Redesigned Dashboard UI: Completely reimagined interface with streamlined workflows and intuitive navigation. Everything you need for code review at your fingertips.
Industry-Leading Benchmarks: Paragon outperforms all competitors on ReviewBench, the authoritative code review benchmark. Both Fast and Deep modes achieve higher accuracy than any published baseline.
Enhanced Detection Accuracy: Improved agent reasoning with 40% better issue identification across security vulnerabilities, performance bottlenecks, and architectural problems.
Faster Review Times: Optimized parallel execution reduces average review completion time by 60% compared to previous versions.

Polarity — the most accurate eval infrastructure for AI agents

Navigation

When to use Polarity

Changelog

Agent Monitoring

Features

How to Enable

Paragon Improvements

New Features

Improvements

Full Testing Suite

Features

Production Monitoring

Features

Paragon v0.1.0

New Features

Improvements

Bug Fixes

End-to-End Testing in Paragon

Features

Paragon v0.0.13

New Features

Improvements

Removed

Introducing Paragon - Multi-Agent QA Engineer

ReviewBenchLite Accuracy Results

Features

Improvements