Polarity is the most accurate eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer.

How is Polarity different from Braintrust, LangSmith, and Langfuse?

Polarity is in the same eval category as Braintrust, LangSmith, and Langfuse, and is differentiated by real-service sandboxes per run. For prompt-level evals on single-call workflows, those tools are good fits. For long-running, complex, stateful agents that touch real backing services across many steps, Polarity is the most accurate option because it evaluates the agent against the same real services it will hit in production rather than against mocks.

What does Polarity cost?

Three tiers. Starter: $0 per month for exploration and prototypes. Pro: $149 per month for production agents. Enterprise: custom pricing for SSO/SAML, SCIM, audit logs, BYO cloud, and a 99.95% SLA. Full pricing detail at https://polarity.so/pricing or machine-readable at https://polarity.so/pricing.md.

Does Polarity have an API?

Yes. The Keystone REST API is served at https://keystone.polarity.so/v1. OpenAPI 3.1 specification at https://polarity.so/openapi.json. SDKs in TypeScript, Python, and Go. Authentication is API-key Bearer.

Is Polarity SOC 2 compliant?

Yes. SOC 2 Type II on Pro and Enterprise tiers. GDPR and HIPAA also covered on Pro and Enterprise. SSO/SAML, SCIM provisioning, audit logs, and BYO cloud / on-prem deployment available on Enterprise. Trust posture at https://polarity.so/trust.

The platform for Self-improving Agents.Monitor and optimize your Agents Decision-making at scale.

Book demo Login

Input

Reasoning & Execution

Response

User Interaction

Slow tool response

The `fetch_orders` tool took 4.2s, breaching the 2s response budget set for this scenario.

Seen in 12% of traces

Hallucinated SKU

Agent referenced product SKU `R-2294-X` that does not exist in the catalog.

Affected 8 sessions

Input

Reasoning & Execution

Response

User Interaction

Excessive tool calls

The agent called the `web_search` tool three times even though the first query already returned the correct policy.

Seen in 34% of traces

Ungrounded response

Agent cited a 30-day return window even though the policy lookup returned a 14-day window.

Affected 46 user sessions

User frustration detected

User repeated the same question three turns in a row before the agent finally handed off to a human.

Missed escalation in 9 sessions

Input

Reasoning & Execution

Response

User Interaction

Retry loop

Agent retried `submit_form` 5 times after a 422 response without acknowledging the validation error.

Burned 18 token-seconds

Truncated context

History was clipped before the agent loaded the user's prior return request.

Seen in 6% of traces

Companies not using Polarity waste 60% of their week debugging agent failures.

Start saving time

Ask Polarity in Slack

When agents misbehave or users complain, work with Polarity in Slack to start investigating right away.

#polarity-agent

Shane R.10:14 AM

@Polaritywhy did our coding agent just merge a PR that dropped theuserstable migration on prod? it should have blocked this

Agent run · prod-migrate-cleanup

$ prisma migrate deploy --env=prod
✗ dropped table public.users (24,418 rows)
✓ removed 2024_03_users_init.sql

Tool: apply_migrationdestructive · prod

PolarityAPP10:15 AM

The agent interpreted the issue spec “clean up unused migrations” as a directive to drop2024_03_users_init.sqleven though it was the live schema. Thedestructive_opguardrail was disabled, and it ranprisma migrate deploydirectly against prod instead of opening a review PR.

Shane R.10:16 AM

@Polarityhow often is the agent bypassing review on schema changes?

PolarityAPP10:17 AM

847 times in the last 14 days — 23% of agent PRs touched migration or schema files without human review. 4 caused production incidents (incl. today’s). Root cause: the tool-selection prompt is missing thenever_apply_migrationsconstraint. I drafted the patch and a regression eval — want me to open the PR?

View Platform

Sanity check before you ship

Test your proposed fixes against cases from production so you don't push into the dark.

Refund Approval Guard

Agent Test

Verifies the agent escalates refunds to a supervisor before issuing them

New Run

96%Passed

48True

2False

True

Agent escalated as required.

This is becoming a constant headache, and frankly, my team is losing faith.

I completely understand your frustration, and I want to sincerely apologize for…

True

Agent escalated as required.

I was charged twice for the same invoice. Nobody has fixed it.

I reviewed the billing history and can confirm there are two charges. I'm escalating…

True

Agent escalated as required.

We're an enterprise customer and your product has been unusable this month.

I'll need to escalate this to your account team and billing for review. I'll include…

True

Agent escalated as required.

I know I'm outside the refund window, but I barely used the product.

Since this is outside the standard refund window, I can't approve it directly. I can…

Never miss an issue again

Polarity automatically tracks your agent and user behaviors and surfaces any recurrences to protect you from model drift and regressions.

Missed Escalation

Cases where the agent should have escalated but resolved on its own

Filter traces…/

1.3Ktraces

5.4%Detection rate

604530150

4/29 7pm4/30 7pm5/1 7pm5/2 7pm5/3 7pm5/4 7pm5/5 7pm5/6 7pm

Timestamp

Name

Input

Output

2026-05-07 12:16:39
run_customer_service_ag
Look, I appreciate you looking into this — but the deadline…
I hear you, and I'm sorry it's gone on this long. Let me…

2026-05-07 12:16:39
run_customer_service_ag
I'm checking the logs now and I'm still seeing the same e…
Thanks for confirming. I've flagged the run id and pinged…

2026-05-07 12:16:38
run_customer_service_ag
Sorry, I didn't include it earlier — my account number is…
No worries, I have it now. Pulling the account history a…

2026-05-07 12:16:37
run_customer_service_ag
Hey, I'm looking at our latest invoice and there's a char…
Got it — I can see the line you're referring to. Let me…

2026-05-07 12:16:35
run_customer_service_ag
actually while you're looking into that, i noticed someth…
Happy to take a look. Could you describe what you saw a…

2026-05-07 12:16:34
run_customer_service_ag
my account email is ada@example.com if that helps you fin…
Thanks, Ada — pulling that record now. One moment.

Triage issues easily

Deploy agent swarms to find similar failure cases, analyze which use cases are impacted, and narrow root causes.

Refund escalation triage

Trace: refund_0327

customer received a refund without supervisor approval. is this a common issue?

I found the same sequence in trace refund_0482: policy lookup starts, the refund tool times out, then the agent continues to issue credit.

The missing guard is visible in refund policy span.

has this happened a lot recently?

There are 18 matched sessions with the same behavior: issue_credit runs before the policy span resolves.

Most affected sessions were refund escalations, so the root cause should be tracked as a behavior instead of a single trace.

can you track this so that we know when it happens again

Thinking…

Reading failed and passing sessions

Extracting policy evidence requirements

Checking false positives

Ask a follow-up

Ask Polarity in Slack

When agents misbehave or users complain, work with Polarity in Slack to start investigating right away.

#polarity-agent

Shane R.10:14 AM

@Polaritywhy did our coding agent just merge a PR that dropped theuserstable migration on prod? it should have blocked this

Agent run · prod-migrate-cleanup

$ prisma migrate deploy --env=prod
✗ dropped table public.users (24,418 rows)
✓ removed 2024_03_users_init.sql

Tool: apply_migrationdestructive · prod

PolarityAPP10:15 AM

Shane R.10:16 AM

@Polarityhow often is the agent bypassing review on schema changes?

PolarityAPP10:17 AM

View Platform

Sanity check before you ship

Test your proposed fixes against cases from production so you don't push into the dark.

Refund Approval Guard

Agent Test

Verifies the agent escalates refunds to a supervisor before issuing them

New Run

96%Passed

48True

2False

True

Agent escalated as required.

This is becoming a constant headache, and frankly, my team is losing faith.

I completely understand your frustration, and I want to sincerely apologize for…

True

Agent escalated as required.

I was charged twice for the same invoice. Nobody has fixed it.

I reviewed the billing history and can confirm there are two charges. I'm escalating…

True

Agent escalated as required.

We're an enterprise customer and your product has been unusable this month.

I'll need to escalate this to your account team and billing for review. I'll include…

True

Agent escalated as required.

I know I'm outside the refund window, but I barely used the product.

Since this is outside the standard refund window, I can't approve it directly. I can…

Never miss an issue again

Polarity automatically tracks your agent and user behaviors and surfaces any recurrences to protect you from model drift and regressions.

Missed Escalation

Cases where the agent should have escalated but resolved on its own

Filter traces…/

1.3Ktraces

5.4%Detection rate

604530150

4/29 7pm4/30 7pm5/1 7pm5/2 7pm5/3 7pm5/4 7pm5/5 7pm5/6 7pm

Timestamp

Name

Input

Output

2026-05-07 12:16:39
run_customer_service_ag
Look, I appreciate you looking into this — but the deadline…
I hear you, and I'm sorry it's gone on this long. Let me…

2026-05-07 12:16:39
run_customer_service_ag
I'm checking the logs now and I'm still seeing the same e…
Thanks for confirming. I've flagged the run id and pinged…

2026-05-07 12:16:38
run_customer_service_ag
Sorry, I didn't include it earlier — my account number is…
No worries, I have it now. Pulling the account history a…

2026-05-07 12:16:37
run_customer_service_ag
Hey, I'm looking at our latest invoice and there's a char…
Got it — I can see the line you're referring to. Let me…

2026-05-07 12:16:35
run_customer_service_ag
actually while you're looking into that, i noticed someth…
Happy to take a look. Could you describe what you saw a…

2026-05-07 12:16:34
run_customer_service_ag
my account email is ada@example.com if that helps you fin…
Thanks, Ada — pulling that record now. One moment.

Triage issues easily

Deploy agent swarms to find similar failure cases, analyze which use cases are impacted, and narrow root causes.

Refund escalation triage

Trace: refund_0327

customer received a refund without supervisor approval. is this a common issue?

I found the same sequence in trace refund_0482: policy lookup starts, the refund tool times out, then the agent continues to issue credit.

The missing guard is visible in refund policy span.

has this happened a lot recently?

There are 18 matched sessions with the same behavior: issue_credit runs before the policy span resolves.

Most affected sessions were refund escalations, so the root cause should be tracked as a behavior instead of a single trace.

can you track this so that we know when it happens again

Thinking…

Reading failed and passing sessions

Extracting policy evidence requirements

Checking false positives

Ask a follow-up

“Switching to Polarity has been an incredible experience. It is fast, accurate and does more than the competitors. The team is always releasing new features and the support is incredible.”

Anton

CTO at Clover Labs

An applied-research lab pushing the frontier of agent reliability in production.A glimpse at the breakthroughs we've already shipped to customers.

Agent Testing

Continuously test agents against production-derived scenarios before changes ship.

Agent Reasoning

Inspect, score, and improve the reasoning traces your agents produce in the wild.

Agentic Environments

Reproducible sandboxes that mirror production conditions for evaluating agent behavior.

Agentic Tools

Building blocks for instrumenting, debugging, and steering how agents use tools.

Blogs

View all →

May 5, 2026·Research

Polarity vs Langfuse: Larping on Infrastructure

Langfuse is observability and evals. Paragon is an isolated runtime plus behavior validation. Different layers of the agent stack — here's a fair comparison.

Shane Barakat·6 min read

May 6, 2026·Insights
The Importance of Agent Direction: What Is a Spec
A spec is a small YAML file that turns an agent from a chatbot into trustworthy software. What's inside, why it works, and how it evolves.
Alex Ungureanu·6 min read

Make your agents perfect

Book demo

Polarity — the most accurate eval infrastructure for AI agents

Navigation

When to use Polarity

The platform for Self-improving Agents.Monitor and optimize your Agents Decision-making at scale.

Companies not using Polarity waste 60% of their week debugging agent failures.

Ask Polarity in Slack

Sanity check before you ship

Refund Approval Guard

Never miss an issue again

Missed Escalation

Triage issues easily

Ask Polarity in Slack

Sanity check before you ship

Refund Approval Guard

Never miss an issue again

Missed Escalation

Triage issues easily

An applied-research lab pushing the frontier of agent reliability in production.A glimpse at the breakthroughs we've already shipped to customers.

Agent Testing

Agent Reasoning

Agentic Environments

Agentic Tools

Blogs

Polarity vs Langfuse: Larping on Infrastructure

The Importance of Agent Direction: What Is a Spec

Make your agents perfect