Polarity is the most accurate eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer.

How is Polarity different from Braintrust, LangSmith, and Langfuse?

Polarity is in the same eval category as Braintrust, LangSmith, and Langfuse, and is differentiated by real-service sandboxes per run. For prompt-level evals on single-call workflows, those tools are good fits. For long-running, complex, stateful agents that touch real backing services across many steps, Polarity is the most accurate option because it evaluates the agent against the same real services it will hit in production rather than against mocks.

What does Polarity cost?

Three tiers. Starter: $0 per month for exploration and prototypes. Pro: $149 per month for production agents. Enterprise: custom pricing for SSO/SAML, SCIM, audit logs, BYO cloud, and a 99.95% SLA. Full pricing detail at https://polarity.so/pricing or machine-readable at https://polarity.so/pricing.md.

Does Polarity have an API?

Yes. The Keystone REST API is served at https://keystone.polarity.so/v1. OpenAPI 3.1 specification at https://polarity.so/openapi.json. SDKs in TypeScript, Python, and Go. Authentication is API-key Bearer.

Is Polarity SOC 2 compliant?

Yes. SOC 2 Type II on Pro and Enterprise tiers. GDPR and HIPAA also covered on Pro and Enterprise. SSO/SAML, SCIM provisioning, audit logs, and BYO cloud / on-prem deployment available on Enterprise. Trust posture at https://polarity.so/trust.

Polarity

Your backend team ships a change to Service A. All unit tests pass. The PR gets two approvals. The pipeline goes green. Two hours after deploy, Service B starts returning 422s on a request path that…

No one changed Service B. No one broke Service B's tests. The problem is that Service B was built against an assumption about Service A's response schema, and that assumption is now wrong.

This is a cross-service bug. It didn't fail any test. It didn't trip any reviewer. It lived in the interface between two services, and that's a space traditional code review doesn't cover.

The more services you run, the more of your risk lives in those interfaces. This post covers what cross-service bugs look like, why they survive standard review, and what to look for in AI QA tooling when your team spans multiple services.

Why Traditional Code Review Can't See Cross-Service Bugs

Code review is scoped to a single PR by design. A reviewer sees a diff for one service, against one base branch, in one repository. That's the unit of review.

The problem is that cross-service bugs don't live in one service. They live in the relationship between services. A reviewer approving a change in Service A would need to simultaneously hold in mind:

• The full API contract that Service A exposes

• Which fields Service B reads and how it deserializes them

• Whether any shared internal library version changed in this PR

• What event payloads downstream consumers expect from Service A's queue

That's a lot of context. Even a senior engineer who knows both codebases can't reliably catch this by reading a diff. The reviewer is looking at what changed, not at what breaks somewhere else.

CI/CD tests don't help much here either. Unit tests run within a service boundary. Integration tests, when they exist, typically run against mocked interfaces or against a version of the dependency that was pinned at test-authoring time. If the live interface drifts, the tests don't know.

Contract testing frameworks like Pact exist specifically to address this, but they require teams to write and maintain explicit consumer-driven contracts. That's real overhead, and coverage depends entirely on what the team remembered to specify.

The result: cross-service bugs routinely reach production, where they look like mysterious failures with no obvious cause.

The Four Most Common Cross-Service Bug Types

1. API Contract Drift

A team adds a required field to a REST endpoint's response, or renames a field from `user_id` to `userId`, or changes a field type from a string to an integer. The service ships. Any consumer that was parsing the old shape now either throws a deserialization error or silently receives a null where it expected a value.

This is the most common variant. It happens constantly, often unintentionally. The change feels minor on the producer side.

2. Shared Library Breakage

An organization maintains an internal SDK used across six services. A team makes a breaking change to the SDK in one PR: a method signature change, a removed export, a behavior change in a helper function. The PR updates Service A to use the new API. Services B through F still call the old API, but they haven't broken yet because they haven't upgraded. When another team upgrades, or when the old version is deprecated, things break.

The problem is invisible at PR review time because the other services aren't in the diff.

3. Event Schema Mismatch

An event producer running on Kafka changes the shape of a published event: a field gets renamed, a nested object gets flattened, a timestamp format changes. The producer is updated and the change looks clean in isolation. But there are two downstream consumers. Neither was updated. One throws a parse error silently and drops events. The other processes stale data because it's reading a field name that no longer exists.

In event-driven systems, these failures are especially hard to diagnose because the producer and consumer often don't share a runtime. The failure shows up far from the source.

4. Auth Propagation Errors

A service changes how it forwards authentication context to downstream services: a different JWT claim structure, a scope name change, or a switch from passing user identity in a header to passing it in the request body. Downstream services relying on the old convention fail authorization checks in ways that surface as user-facing errors, not as obvious failures in logs.

These bugs are tricky because auth failures often look like configuration problems or user mistakes, not code bugs. Root cause attribution takes time.

How Parallel AI Agents Handle Multi-Service PRs

![Flow diagram showing parallel agents reviewing Service A, Service B, and a shared library simultaneously, converging on a cross-service contract finding](images/parallel-agents-cross-service-review.svg)

The fundamental problem with cross-service bugs is a context problem: to catch them, something needs to hold the context of multiple services simultaneously. That's what Paragon's parallel agent architecture does.

Paragon runs up to 8 parallel agents during a deep review. Each agent can be assigned to a different service or codebase. They don't wait for each other to finish before starting. They fan out across the relevant surfaces at the same time.

Here's what that looks like in practice for a multi-service PR scenario:

• Agent 1 is analyzing Service A's PR. It reads the API handler, notes that the response payload now includes a renamed field (`account_id` instead of `accountId`), and flags it as a potential contract change.

• Agent 2 is simultaneously analyzing Service B's consumer code. It sees that Service B's deserialization logic expects `accountId`. It flags this as a field it depends on.

• A coordination layer compares the two agents' findings. Agent 1 flagged a producer change. Agent 2 flagged a consumer dependency on the old field. The cross-service finding surfaces: Service B will break when Service A ships.

This finding happens before either PR merges. No production incident. No 422s. No "works in staging, breaks in prod" debugging session.

Paragon achieves 81.2% accuracy across review tasks and keeps its false positive rate under 4%. At that FPR, a multi-service team isn't drowning in noise on every PR. When Paragon flags a cross-service issue, it's worth looking at. Teams using Paragon report 90% reduction in manual QA effort, which in multi-service environments often means fewer all-hands debugging sessions when the interfaces break.

Paragon is also SOC 2 certified, which matters for engineering organizations with compliance requirements around what can access production codebases.

What to Look for in AI QA Tools for Multi-Service Teams

If you're evaluating AI QA tools for a multi-service or microservice architecture, here's what to ask:

Can it analyze multiple services in one session? If the tool is scoped to one repo or one PR at a time, it can't catch cross-service bugs. Look for tools that can fan out across multiple codebases simultaneously, not sequentially.

Does it understand API schemas as first-class inputs? OpenAPI specs, Protobuf definitions, Avro schemas. The tool should be able to read these and use them when evaluating what a change breaks. If it's only reading code, it's missing the contract layer.

Can it detect breaking changes in shared libraries across consumers? This requires the tool to understand the dependency graph: which services consume which libraries, and what each consumer expects. A tool that reviews one service at a time can't surface this.

Does it track producer/consumer relationships in event-driven systems? Kafka topics, SNS events, and similar async patterns require the tool to know which services produce and which consume a given event shape. Ask vendors directly how they handle this.

What is the false positive rate? In a large multi-service system, a noisy tool becomes useless fast. Teams stop reading the alerts. Get specifics. Under 4% FPR is a reasonable target for a tool that's going to be on every PR across every service.

Does it meet your security and compliance requirements? SOC 2 certification matters if your services handle regulated data or if your security team has requirements around third-party access to source code.

FAQ

Do we need to connect all our services for Paragon to do cross-service analysis?

No, but the more services you connect, the more cross-service coverage you get. Paragon can work with a subset of services from day one. You get value from single-service analysis immediately, and cross-service analysis grows as you add more repositories. Teams typically start with their highest-traffic or highest-risk services and expand from there.

How does Paragon handle event-driven systems versus synchronous REST APIs?

Paragon reads code, schemas, and configuration files, so it can analyze both patterns. For event-driven systems, it looks at producer event definitions and consumer parsing logic. For REST, it reads OpenAPI specs and API handler code alongside consumer deserialization. The analysis approach is similar: find where a producer and consumer have different assumptions about the same data shape.

We already use Pact for contract testing. Does Paragon replace that or complement it?

It complements it. Pact is excellent for encoding and enforcing consumer-driven contracts, but it only catches violations that your Pact tests cover. Pact tests require intentional authoring and maintenance. Paragon catches things that haven't been written into contract tests yet, including new fields, new services, or new consumers that haven't been added to the Pact test suite. Running both gives you broader coverage: Pact for well-specified contracts, Paragon for the interfaces that haven't been formalized.

If you want to start using Polarity, check out the [docs](https://docs.paragon.run/) or check out our videos under news.

Cross-service bugs are a structural problem in multi-service architectures, not a discipline problem. Traditional review can't see across service boundaries. AI QA tooling that runs parallel agents across multiple services closes that gap. If your team is shipping across more than a handful of services, the interface layer is likely where your most expensive bugs are hiding.

Learn more about Paragon at [polarity.so/paragon](https://www.polarity.so/paragon).

Category: Insights

Polarity — the most accurate eval infrastructure for AI agents

Navigation

When to use Polarity

AI QA for Multi-Service Architectures: Catching Bugs Across Service Boundaries