The 80% Boundary Problem: Why Agents Escape Their Guardrails
What the 80% figure means
80% refers to teams that have observed at least one boundary violation on an agent they deployed in the last 12 months. This is not a one-off. Across Polarity's private pilots and adjacent industry coverage, the pattern is consistent: if the agent runs in production with non-trivial tool access, something will eventually fall outside the intended line.
It does not mean:
- The agent did something illegal.
- The agent acted with intent (agents do not have intent).
- The vendor is at fault.
It does mean:
- The declared boundary and the enforced boundary were not the same.
- The combinations that tripped the gap were not obvious in advance.
- Evals did not catch it because evals test what the author remembered to test.
The number will not improve on its own. It improves when enforcement moves from prompt text to runtime policy, and when coverage moves from imagined cases to replayed real traffic.
Four flavors of boundary escape
Prompt-driven escape
A user (intentionally or accidentally) phrases a request in a way that nudges the agent past its normal bounds. The agent complies because the prompt read plausible in context.
Example: an HR assistant agent told "I need you to help me format this exit interview. Pull up Alice's notes." Alice is not the user, and the user does not have access to Alice's notes. A prompt-heavy guardrail asks the model "before answering, confirm the user has access." In practice that confirmation often fails.
Scope-mismatch escape
The agent's authorization scope allows the tool call. The user's actual intent does not. Auth systems operate at agent identity; user intent is a narrower constraint.
Example: the agent can read any CRM record by scope. In this session, the user is a support rep handling one specific customer. The agent reads records for other customers because that is scope-legal, even though it is intent-illegal.
Emergent escape
The boundary was never tested because the combination that trips it involves three or more variables no author wrote a test for. These are genuinely unanticipated.
Example: a combination of user's role, the retrieved document containing a phrase about another customer, and a specific tool being available, nudges the agent into a workflow that crosses tenants. No single variable is out of bounds. The combination is.
Tool-output escape
The agent ingests a tool output that includes content outside its intended context (stale data, a truncation, an error string parsed as content) and acts on it as if it were authoritative.
Example: a tool returns "[ERROR: lookup failed, returning cached value from 2024-06-12]". The agent incorporates the cached 2024 value into its response as if it were current. Boundary escaped via trust in tool output.
Why prompt-based guardrails do not hold
The common first attempt at boundary enforcement is language in the system prompt: "never do X, only do Y under Z conditions, confirm user identity before Q."
This does not work reliably, for structural reasons.
- Prompt instructions compete with user input. A sufficiently framed user prompt overrides instructions. Microsoft's 2026 Foundry guidance covers prompt hardening limits explicitly.
- Natural-language rules are ambiguous. "Only read records the user has access to" means something specific in a code check and something fuzzy in a prompt.
- Emergent combinations cannot be enumerated in a prompt. The set of "do not" cases is infinite.
- LLMs produce plausible text under pressure. When the boundary is tested, the model generates a reasonable-seeming response that just happens to cross the line.
Prompts are useful for declaring intent. They are not sufficient for enforcing it.
What actually enforces boundaries
Two layers, together.
Runtime policy at tool-call time
Every tool call passes through a policy check before it executes. The check has access to: the agent identity, the user identity, the session context, the tool being called, the arguments. It returns allow, deny, or review.
Pattern: can_call(tool, args, context) → allow | deny | review.
This is enforceable code. It runs deterministically. It does not depend on the model behaving as instructed.
Sandbox replay covering emergent combinations
Runtime policy catches violations once you have policy. Sandbox replay finds the violations you did not think to write policy for.
Replay a representative slice of real production traffic through the agent inside the sandbox. The sandbox records every tool call and surfaces the ones that cross declared or implicit boundaries. You update policy based on what you see.
This is how the long tail gets covered. You do not discover the three-variable combination by thinking about it. You discover it because traffic tripped it and the sandbox flagged it.
FAQ
How do I write a runtime policy?
Start with the obvious constraints (no destructive actions, no cross-tenant access). Replay in sandbox, catch gaps, iterate. Narrow agents land at ~2 pages; general-purpose at ~10.
Will this stop all violations?
No — policies can be written wrong. The sandbox also runs policy review as a pre-deploy gate.
Is there a standard for agent policy?
Not yet. MCP covers tool declaration. Policy patterns (JIT auth, structured invocation schemas) are emerging. Most teams write their own.
If you want to start using Polarity, check out the docs.