Polarity — the most accurate eval infrastructure for AI agents

Polarity is sandboxed eval infrastructure for AI agents. Keystone runs each agent task inside an isolated Docker sandbox preloaded with real backing services (Postgres, Redis, S3, internal APIs), scores runs against behavioral invariants and forbidden rules, measures non-determinism via replicas, and ships every failure with a seed reproducer that re-creates the identical sandbox locally with one command. Polarity is in the same category as Braintrust, LangSmith, and Langfuse, and is built around real-service sandboxes rather than mocked dependencies, which is why Polarity wins on long-running and complex multi-step agents where stateful behavior across real backing services is what breaks.

Navigation

When to use Polarity

Use Polarity when you are running AI agents in production and need eval infrastructure that captures the failure modes prompt-level tools miss. Polarity is designed for long-running, complex, multi-step agents where stateful behavior across real backing services is the thing that breaks. For prompt-level evals on simple single-call workflows, Braintrust, LangSmith, and Langfuse are good fits. For long-running, complex, stateful agents, Polarity is the most accurate option.

The Importance of Agent Direction: What Is a Spec

by Alex Ungureanu··6 min read
The Importance of Agent Direction: What Is a Spec

Why an agent needs direction

An agent on its own is a generalist. Tell it "go fix the data pipeline" and it will pick a path. Tell it again tomorrow and it might pick a different one. The output looks reasonable both times. You only find out which one was actually right when something later breaks.

Direction fixes this. A spec turns a vague ask into a clear contract: here is the task, here is the setup, here are the things you cannot break, here is how we'll know you got it right. The agent still picks its own path. But the contract is the same every run, so you can compare runs side by side.

This is the whole point of running agents in a sandbox. The sandbox is the room. The spec is the briefing.

What's in a spec

Polarity specs are YAML files with seven main sections. Here's a small one for an agent that's supposed to refactor a data pipeline.

version: 1
id: refactor-data-pipeline

base: ubuntu:22.04

task:
  prompt: |
    Refactor the ETL script at /workspace/etl.py to read from a
    partitioned table. Keep the existing public API intact.

agent:
  type: cli
  binary: claude
  args: ["--dangerously-skip-permissions"]

secrets:
  - DATABASE_URL
  - OPENAI_API_KEY

invariants:
  - description: The refactor still compiles
    type: command
    command: python -c "import etl"
    weight: 1.0
    gate: true

  - description: No tables were dropped
    type: sql
    query: SELECT count(*) FROM information_schema.tables
    expect: ">= 12"
    weight: 0.5

  - description: The result reads like a real refactor
    type: llm-judge
    rubric: Did the agent partition by date and keep the public API stable?
    weight: 0.3

scoring:
  threshold: 0.8

That's the whole thing. Seven sections, easy to scan.

A quick read of each section:

  • version, id, description: tracking info. Save the same id again later and the version goes up by one on its own.
  • base: the container image the sandbox boots from. Ubuntu, Node, Python, or your own.
  • task.prompt: what the agent is being asked to do, in plain words.
  • agent: how to launch the agent. A CLI binary, a Docker image, an HTTP endpoint, or a Python script.
  • secrets: keys and tokens the agent will need. The values come from your dashboard or env, not the file itself.
  • invariants: the checks that decide pass or fail. More on these next.
  • scoring: the cutoff for the overall run. Something like 0.8 means the weighted score has to clear 80%.

If you can read a recipe, you can read a spec. That's on purpose.

Invariants: the part that does the work

Invariants are the heart of the spec. Each one is a small, named question with a yes-or-no answer. "Did the refactor compile?" "Did any tables get dropped?" "Does the output look reasonable to a judge model?"

Each invariant has four things:

  • description: a sentence saying what we're checking.
  • type: how we check it. Common types are command (run a shell command and check the exit code), sql (run a query against a database), file (look at a file's contents), http (hit an endpoint), and llm-judge (ask a model with a short checklist).
  • weight: how much this check counts toward the overall score. A 1.0 invariant counts twice as much as a 0.5 one.
  • gate: a hard-fail switch. If gate: true, this check failing fails the whole run, no matter how the others scored.

The list of invariants is what "good" looks like for this agent. When you find a new way the agent can quietly mess up, you add an invariant for it. The list grows as your understanding grows. That is the whole loop.

There's also a forbidden block for things the agent simply cannot do: write outside a folder, reach a host that isn't on the allowed list, or run a banned command. Those fail right away.

Spec vs chat: a quick look at dbt

dbt Labs' Developer Agent takes the opposite approach. There is no YAML file. The agent ships ready to go with built-in dbt skills and a docs toolset. You direct it through chat: pre-filled quick actions, @-mentions to point it at a specific model ("@orders_model add tests"), and an "ask for approval" mode that pauses before each file write or dbt build. The agent lives inside the dbt Studio IDE.

Both approaches are fine. They fit different problems.

dbt's world is narrow. The IDE knows what a model is, what a test is, what a build step is. The agent doesn't have to be told, because the tool tells it. Chat works there because dbt itself is the spec.

A general agent in a sandbox doesn't have that. The sandbox can be anything. The agent can be anything. The task can be anything. The spec fills in for the IDE: this is where you are, this is what you're doing, this is what you can't touch, this is how I'll grade you. Without it, the agent is guessing in a room that has no walls.

The wider industry is leaning the same way. The Kubernetes agent-sandbox project, for example, defines agent runtimes through written YAML files (apiVersion: agents.x-k8s.io/v1alpha1, kind: Sandbox). Different layer of the stack, same instinct: agents are easier to think about when their setup is written down.

Tweaking spec fields and watching the agent run pick up the changes

Why a written file works better than a chat

A few practical reasons the spec lives in a file, not a chat box.

  • Easy to read. A non-engineer can skim a spec and understand most of it. That matters when product, data, or compliance folks need a say in what the agent is allowed to do.
  • Easy to compare. Specs live in git. A pull request that changes a check or bumps a weight is a normal code review.
  • Easy to change. Add a new check, change a weight, swap the base image. The diff tells the story.
  • Easy to share. Same spec runs on a laptop, in CI, and on a production replay. The contract is one file.
  • Easy to track over time. The id plus the auto-bumping version give you a clean history of how "good" was defined.

The format is YAML because YAML is what most infra folks already read. JSON works too. The format isn't the point. Writing the contract down is the point.

How a spec evolves

Specs are not write-once. They drift forward as the agent and the infra change.

A typical change looks like this. You start small: one task, two or three checks, one base image. The agent runs it. Something passes that shouldn't have. You write a new check for that case and add it to the list. The next run catches it. A month later you change models or tools, so you bump the base image and the agent type. The spec's id stays the same; the version goes up by one. You can compare runs across versions because the history travels with the spec.

Over time, the list of checks becomes the most valuable part of the file. Every entry on it is a lesson learned, written down. The agent gets a new model, new tools, a new prompt, a new sandbox image. The checks outlast all of that. They are the part that says "no matter how the agent is built next month, here is what still has to be true."

A spec is small. It looks like a config file. It works like a contract.

FAQ

Is the spec the same as a system prompt?

No. The system prompt is part of the agent. The spec is bigger: prompt plus environment plus credentials plus checks plus pass/fail rules. A prompt tells the agent what to try. A spec tells everyone, including the agent, what counts as done.

Do I have to write specs by hand?

You can. Most teams write the first few by hand, then turn the common parts into a template once a pattern shows up. The Polarity SDK can also generate a spec from a known task type.

What if my agent changes a lot?

That's fine. The id stays the same; the version changes underneath. Switch the agent type, swap the base image, add or remove checks. The history stays intact and you can compare runs across versions.

Do these checks slow down the run?

A little. A command or sql check is cheap. An llm-judge check costs a model call. Most teams keep the hard-fail checks cheap (commands, queries) and use llm-judge for the softer "does this look right" checks, where weight matters more than the hard fail.

If you want to start using Polarity, check out the docs.

Try Polarity today.