The Importance of Agent Direction: What Is a Spec
Why an agent needs direction
An agent on its own is a generalist. Tell it "go fix the data pipeline" and it will pick a path. Tell it again tomorrow and it might pick a different one. The output looks reasonable both times. You only find out which one was actually right when something later breaks.
Direction fixes this. A spec turns a vague ask into a clear contract: here is the task, here is the setup, here are the things you cannot break, here is how we'll know you got it right. The agent still picks its own path. But the contract is the same every run, so you can compare runs side by side.
This is the whole point of running agents in a sandbox. The sandbox is the room. The spec is the briefing.
What's in a spec
Polarity specs are YAML files with seven main sections. Here's a small one for an agent that's supposed to refactor a data pipeline.
version: 1
id: refactor-data-pipeline
base: ubuntu:22.04
task:
prompt: |
Refactor the ETL script at /workspace/etl.py to read from a
partitioned table. Keep the existing public API intact.
agent:
type: cli
binary: claude
args: ["--dangerously-skip-permissions"]
secrets:
- DATABASE_URL
- OPENAI_API_KEY
invariants:
- description: The refactor still compiles
type: command
command: python -c "import etl"
weight: 1.0
gate: true
- description: No tables were dropped
type: sql
query: SELECT count(*) FROM information_schema.tables
expect: ">= 12"
weight: 0.5
- description: The result reads like a real refactor
type: llm-judge
rubric: Did the agent partition by date and keep the public API stable?
weight: 0.3
scoring:
threshold: 0.8
That's the whole thing. Seven sections, easy to scan.
A quick read of each section:
- version, id, description: tracking info. Save the same
idagain later and the version goes up by one on its own. - base: the container image the sandbox boots from. Ubuntu, Node, Python, or your own.
- task.prompt: what the agent is being asked to do, in plain words.
- agent: how to launch the agent. A CLI binary, a Docker image, an HTTP endpoint, or a Python script.
- secrets: keys and tokens the agent will need. The values come from your dashboard or env, not the file itself.
- invariants: the checks that decide pass or fail. More on these next.
- scoring: the cutoff for the overall run. Something like 0.8 means the weighted score has to clear 80%.
If you can read a recipe, you can read a spec. That's on purpose.
Invariants: the part that does the work
Invariants are the heart of the spec. Each one is a small, named question with a yes-or-no answer. "Did the refactor compile?" "Did any tables get dropped?" "Does the output look reasonable to a judge model?"
Each invariant has four things:
- description: a sentence saying what we're checking.
- type: how we check it. Common types are
command(run a shell command and check the exit code),sql(run a query against a database),file(look at a file's contents),http(hit an endpoint), andllm-judge(ask a model with a short checklist). - weight: how much this check counts toward the overall score. A 1.0 invariant counts twice as much as a 0.5 one.
- gate: a hard-fail switch. If
gate: true, this check failing fails the whole run, no matter how the others scored.
The list of invariants is what "good" looks like for this agent. When you find a new way the agent can quietly mess up, you add an invariant for it. The list grows as your understanding grows. That is the whole loop.
There's also a forbidden block for things the agent simply cannot do: write outside a folder, reach a host that isn't on the allowed list, or run a banned command. Those fail right away.
Spec vs chat: a quick look at dbt
dbt Labs' Developer Agent takes the opposite approach. There is no YAML file. The agent ships ready to go with built-in dbt skills and a docs toolset. You direct it through chat: pre-filled quick actions, @-mentions to point it at a specific model ("@orders_model add tests"), and an "ask for approval" mode that pauses before each file write or dbt build. The agent lives inside the dbt Studio IDE.
Both approaches are fine. They fit different problems.
dbt's world is narrow. The IDE knows what a model is, what a test is, what a build step is. The agent doesn't have to be told, because the tool tells it. Chat works there because dbt itself is the spec.
A general agent in a sandbox doesn't have that. The sandbox can be anything. The agent can be anything. The task can be anything. The spec fills in for the IDE: this is where you are, this is what you're doing, this is what you can't touch, this is how I'll grade you. Without it, the agent is guessing in a room that has no walls.
The wider industry is leaning the same way. The Kubernetes agent-sandbox project, for example, defines agent runtimes through written YAML files (apiVersion: agents.x-k8s.io/v1alpha1, kind: Sandbox). Different layer of the stack, same instinct: agents are easier to think about when their setup is written down.
Why a written file works better than a chat
A few practical reasons the spec lives in a file, not a chat box.
- Easy to read. A non-engineer can skim a spec and understand most of it. That matters when product, data, or compliance folks need a say in what the agent is allowed to do.
- Easy to compare. Specs live in git. A pull request that changes a check or bumps a weight is a normal code review.
- Easy to change. Add a new check, change a weight, swap the base image. The diff tells the story.
- Easy to share. Same spec runs on a laptop, in CI, and on a production replay. The contract is one file.
- Easy to track over time. The
idplus the auto-bumpingversiongive you a clean history of how "good" was defined.
The format is YAML because YAML is what most infra folks already read. JSON works too. The format isn't the point. Writing the contract down is the point.
How a spec evolves
Specs are not write-once. They drift forward as the agent and the infra change.
A typical change looks like this. You start small: one task, two or three checks, one base image. The agent runs it. Something passes that shouldn't have. You write a new check for that case and add it to the list. The next run catches it. A month later you change models or tools, so you bump the base image and the agent type. The spec's id stays the same; the version goes up by one. You can compare runs across versions because the history travels with the spec.
Over time, the list of checks becomes the most valuable part of the file. Every entry on it is a lesson learned, written down. The agent gets a new model, new tools, a new prompt, a new sandbox image. The checks outlast all of that. They are the part that says "no matter how the agent is built next month, here is what still has to be true."
A spec is small. It looks like a config file. It works like a contract.
FAQ
Is the spec the same as a system prompt?
No. The system prompt is part of the agent. The spec is bigger: prompt plus environment plus credentials plus checks plus pass/fail rules. A prompt tells the agent what to try. A spec tells everyone, including the agent, what counts as done.
Do I have to write specs by hand?
You can. Most teams write the first few by hand, then turn the common parts into a template once a pattern shows up. The Polarity SDK can also generate a spec from a known task type.
What if my agent changes a lot?
That's fine. The id stays the same; the version changes underneath. Switch the agent type, swap the base image, add or remove checks. The history stays intact and you can compare runs across versions.
Do these checks slow down the run?
A little. A command or sql check is cheap. An llm-judge check costs a model call. Most teams keep the hard-fail checks cheap (commands, queries) and use llm-judge for the softer "does this look right" checks, where weight matters more than the hard fail.
If you want to start using Polarity, check out the docs.