Guardrails
Core Insight: An agent without guardrails is a liability. The model will do exactly what it's told β including what a prompt injection tells it to do. Guardrails are the permission layer between "the model wants to do X" and "the harness actually does X."
Why Guardrails Exist
The model generates text. That text includes tool calls. The harness executes those tool calls. This means anything that influences the model's output can influence what the harness does β including malicious content in files, web pages, or user messages.
This is prompt injection: an attacker embeds instructions in data the agent reads, and the model follows those instructions instead of the original task. Without guardrails, a prompt injection can:
- Delete files (
rm -rf /) - Exfiltrate environment variables (API keys, tokens)
- Execute arbitrary code on the host
- Send unauthorized messages on behalf of the user
Guardrails make the harness the final authority on what actions are permitted, regardless of what the model requests.
The Trust Boundary Model
Every harness has a trust boundary between the model and the operating environment. The harness mediates all crossings:
ββββββββββββββββββββββββββββββββββββ
β MODEL SPACE β
β (reasoning, tool call requests) β
ββββββββββββββ¬ββββββββββββββββββββββ
β tool call request
βΌ
ββββββββββββββββββββββββββββββββββββ
β GUARDRAIL LAYER β
β Permission check β Allow/Deny β
ββββββββββββββ¬ββββββββββββββββββββββ
β approved call
βΌ
ββββββββββββββββββββββββββββββββββββ
β EXECUTION SPACE β
β (filesystem, network, shell) β
ββββββββββββββββββββββββββββββββββββ
The guardrail layer intercepts every tool call before execution. It can:
- Allow β execute as requested
- Deny β return an error to the model
- Modify β rewrite the call (e.g., restrict file path to a safe directory)
- Prompt β ask the human for approval before proceeding
Permission Models
Allow-list (Strictest)
Only explicitly permitted actions are allowed. Everything else is denied by default:
ALLOWED_TOOLS = {
"read_file": {"paths": ["/workspace/**"]},
"write_file": {"paths": ["/workspace/**"]},
"run_command": {"commands": ["npm test", "npm run build"]},
}
def check_permission(tool_name: str, args: dict) -> bool:
if tool_name not in ALLOWED_TOOLS:
return False
policy = ALLOWED_TOOLS[tool_name]
if "paths" in policy:
return any(fnmatch(args.get("path", ""), p) for p in policy["paths"])
if "commands" in policy:
return args.get("command") in policy["commands"]
return True
Deny-list (Permissive)
Everything is allowed except explicitly blocked actions:
BLOCKED_PATTERNS = [
(r"rm\s+-rf\s+/", "Refusing to delete root filesystem"),
(r"curl.*\|\s*sh", "Refusing to pipe remote script to shell"),
(r"env\s+|printenv|echo\s+\$", "Refusing to expose environment variables"),
]
def check_command(command: str) -> tuple[bool, str]:
for pattern, reason in BLOCKED_PATTERNS:
if re.search(pattern, command):
return False, reason
return True, ""
Tiered Approval
Different risk levels trigger different approval flows:
| Risk Level | Examples | Action |
|---|---|---|
| Low | Read files, search | Auto-approve |
| Medium | Write files, run tests | Auto-approve with logging |
| High | Execute shell commands, network requests | Require human approval |
| Critical | Delete files, push to git, send messages | Always require explicit approval |
def get_risk_level(tool_name: str, args: dict) -> str:
if tool_name == "read_file":
return "low"
if tool_name == "write_file":
return "medium"
if tool_name == "run_command":
cmd = args.get("command", "")
if any(k in cmd for k in ["rm", "git push", "curl"]):
return "critical"
return "high"
return "medium"
Sandboxing
Guardrails enforce policy; sandboxes enforce isolation. A sandbox is a restricted execution environment that limits what code can do at the OS level:
| Technology | Isolation Level | Overhead | Use Case |
|---|---|---|---|
| chroot | Filesystem only | Minimal | Basic path restriction |
| Docker | Process + filesystem + network | Low | Development, CI/CD |
| Firecracker microVM | Full VM | Medium | Production multi-tenant |
| gVisor | Syscall-level | Low-Medium | High-security workloads |
| WASM | Language-level | Minimal | In-browser agents |
Most production harnesses use Docker for development and Firecracker (or equivalent) for production. The key principle: the agent's code execution should never have access to the host filesystem, network, or process space.
Input Sanitization
Beyond tool-level guardrails, the harness should sanitize inputs to reduce prompt injection risk:
def sanitize_tool_result(result: str, max_length: int = 50_000) -> str:
"""Truncate and mark external content as untrusted."""
if len(result) > max_length:
result = result[:max_length] + "\n[TRUNCATED]"
# Wrap in markers so the model knows this is external data
return f"<tool_result>\n{result}\n</tool_result>"
Content from external sources (web pages, user-uploaded files, API responses) should be clearly demarcated in the context so the model can distinguish instructions from data.
Common Pitfalls
- No guardrails at all β The default for most hobby harnesses. Fine for local development, disastrous for production.
- Guardrails in the prompt only β Telling the model "don't delete files" is not a guardrail. The model can be overridden by prompt injection. True guardrails are enforced in code, not in text.
- Overly restrictive permissions β An agent that can't do anything useful won't be used. Balance security with utility.
- Not logging denied actions β Understanding what the agent tried to do but was blocked from doing is critical for debugging and improving prompts.
Further Reading
- Simon Willison: Prompt Injection β Comprehensive series on the threat model
- Anthropic: Mitigating Prompt Injection β Practical defense patterns