The Harness Is the Security Layer
AI agents are getting real permissions. They execute code, call APIs, modify infrastructure, and make decisions with downstream consequences. The security question has shifted from "can the model say something harmful" to "can the agent do something harmful." That's not a content moderation problem. That's access control, blast radius containment, and behavioral enforcement. Classic security engineering applied to a new kind of principal.
So where do you enforce security? There are four candidate layers: the prompt, the model weights, the VM sandbox, and the harness. The industry is converging on all four simultaneously under the banner of "defense in depth." But the layers are not equal. Three of them have structural limitations that make them necessary but insufficient. One of them is load-bearing.
What is harness engineering?
The term comes from Birgitta Bockeler's analysis of an experiment where a team built a 1M+ line codebase using AI agents over five months with "no manually typed code" as a forcing function. The harness is defined as "the tooling and practices we can use to keep AI agents in check," mixing deterministic and LLM-based approaches across three categories: context engineering (controlling what the model sees), architectural constraints (enforcing structural boundaries), and garbage collection (periodic agents detecting inconsistencies and violations).[1]
The security framing writes itself. Context engineering is information-theoretic least privilege. Architectural constraints are enforceable security policy. Garbage collection is continuous compliance monitoring. The harness isn't a new concept for security engineers. It's the control plane pattern applied to a new kind of compute.
Layer 1: prompt-level security
System prompts and prompt-level guardrails instruct the model to refuse dangerous actions. "Do not execute destructive commands." "Never access production credentials." The appeal is obvious: fast to implement, zero infrastructure cost, and easy to reason about. The problem is equally obvious: it doesn't work as a security boundary.
Prompt injection is ranked #1 on the OWASP Top 10 for LLM Applications, appearing in over 73% of production AI deployments assessed during security audits.[2] OpenAI has acknowledged that AI browsers "may always be vulnerable to prompt injection attacks."[3] In 2025, a major coding assistant suffered a CVE allowing remote code execution via prompt injection, potentially compromising millions of developer machines.[4] A separate vulnerability demonstrated second-order prompt injection: a low-privilege agent tricking a higher-privilege agent into performing unauthorized actions on its behalf.[4]
Indirect prompt injection hides malicious instructions in content retrieved via RAG, web pages, loaded files, or tool outputs. The model processes this external content as part of its context and executes the hidden instruction.[5][6] The attack surface isn't the prompt itself. It's everything the model reads.
The fundamental issue is well stated by NVIDIA's security guidance: "If a capability is dangerous, it should be removed via policy, not prompts."[7] A prompt is a suggestion to a probabilistic system. There is no deterministic guarantee that "do not execute rm -rf" in a system prompt will hold under adversarial input. When a prompt-level guard fails, there's no log, no linter output, no test failure. It just does the thing. You find out after the damage.
And it doesn't scale with autonomy. As agents get longer-running and multi-step, the attack surface of each prompt interaction compounds. You can't prompt-engineer your way out of a 200-step agentic workflow. Prompt-level security is perimeter security for a system with no perimeter.
Layer 2: base model fine-tuning / RLHF
Baking safety into model weights via RLHF or Constitutional AI sounds ideal: make the model inherently safe. The tradeoffs are structural.
First, you almost certainly don't own the model. Most organizations consume models via API. You can't RLHF someone else's model. Fine-tuning open-weight models means maintaining a fork of a rapidly moving foundation, which is itself a security liability. Every upstream update requires re-evaluation. Every custom safety behavior needs re-validation.
Second, the alignment tax is real and documented. RLHF can cause "forgetting" of pretrained abilities, creating a measured trade-off between alignment performance and capability preservation.[8] Providers typically spend $8-15M in additional compute per major model release on alignment procedures. Runtime safety monitors add 10-30% latency.[9] Heavy safety fine-tuning degrades the very capability that makes the model useful.
Third, RLHF safety training can't encode organization-specific security policies. "Don't touch production databases." "Don't call external APIs without auth headers." "Don't write to /etc." These are your policies, not universal norms. RLHF operates at the level of broad behavioral dispositions. It either over-refuses benign requests or gets bypassed by novel framings. Neither failure mode is acceptable for security.
Fourth, model weights are frozen at training time. Your threat model changes weekly. A fine-tuning cycle takes days and costs thousands of dollars. A harness constraint deploys in minutes with a new linter rule or architectural boundary. Static defense against dynamic threats is a losing strategy every security engineer has learned the hard way.
Layer 3: VM sandboxing
MicroVM sandboxing is the most technically impressive of the insufficient layers. Firecracker boots in ~125ms with ~5MB memory overhead per instance, providing dedicated kernels per workload. A kernel exploit inside one VM cannot reach the host or other VMs. Docker is moving from containers to dedicated microVMs for coding agents. This is genuinely strong isolation.[10][11][12]
But sandboxing solves the wrong problem in isolation. A prompt-injected agent running inside a sandbox is still a compromised agent. It just can't escape the VM. If the sandbox has network access, database credentials, or API keys -- and it needs these to be useful -- the agent can exfiltrate data, corrupt state, or make unauthorized calls within the sandbox boundary.[13]
Data exfiltration from inside a sandbox is genuinely hard to prevent. A compromised sandbox can encode data in DNS queries, use ICMP tunneling, or exploit any application-layer protocol that's allowed. Egress allowlists help but don't eliminate the problem.[10]
Poisoned inputs work perfectly inside sandboxes. Malicious repository configuration files, git histories with embedded prompt injections, and malicious tool responses all operate inside the isolation boundary. The sandbox doesn't protect against content that the agent is supposed to read.[7]
Here's the core tension: the whole point of an agent is to do things. A perfectly sandboxed agent with no network, no filesystem, and no API access is useless. The moment you grant capabilities, the sandbox becomes a blast radius limiter, not a security control. It contains damage; it doesn't prevent it. Sandboxing is a containment strategy. What we need is a prevention strategy.
Layer 4: the harness
The harness sits between the model and the world. This is where security has always worked best: at the control plane, not the data plane or the compute layer.
Deterministic gates for security-critical operations. Deterministic guardrails are
"hard rules that reject, redact, or add security context to inputs and outputs, providing absolute
boundaries that cannot be linguistically manipulated." When the cost of failure is high, you want a
pure function that returns false, not a probabilistic
guess.[14][15]
Context engineering as least privilege. The harness controls what the model sees. If the agent doesn't need production credentials to write a frontend component, the harness never surfaces them. This is network segmentation applied to the information plane. Most IAM failures occur because systems only know who is asking, not what is happening in the world. Context engineering closes that gap.[14]
Architectural constraints as enforceable policy. Module boundaries, stable data structures, and structural tests map directly to blast radius containment, least privilege, and separation of concerns. If the agent structurally cannot reach production credentials because the harness enforces that boundary, no prompt injection matters. The attack doesn't fail because the model chose to refuse. It fails because the path doesn't exist.[1]
The feedback loop is an audit trail. "When agents struggle, treat it as diagnostic feedback: identify what is missing -- tools, guardrails, documentation -- and feed it back." Every harness failure is a logged, reviewable event. This is security telemetry built into the development lifecycle, not bolted on after the fact.[1]
Garbage collection is continuous compliance. Periodic agents detecting inconsistencies and constraint violations is continuous security monitoring. It's SAST/DAST applied to the entire agent lifecycle, not just a CI gate.[1]
Defense in depth without alignment tax. Layered controls (architectural constraints + deterministic linters + LLM monitors + garbage collection) without degrading model capability. The model stays maximally capable; constraints are external. You get security without paying for it in performance.[16]
Adaptable at the speed of threats. A new linter rule or architectural constraint deploys in minutes. A model fine-tune takes days and costs thousands. A VM image rebuild takes hours. The harness matches the pace of the threat landscape because it's software, not weights or infrastructure.
The comparison
| Layer | Strength | Structural Weakness |
|---|---|---|
| Prompt | Fast to implement, zero infra cost | Not a security boundary; bypassed by injection |
| Model Weights | Broad behavioral shaping | Alignment tax, static, not org-specific, you don't own it |
| VM Sandbox | Strong blast radius containment | Doesn't prevent semantic attacks; useful agents need permissions |
| Harness | Deterministic + adaptive, org-specific, observable | Requires significant tooling investment; not a quick win |
The right mental model
Think about it like container security. You could try to make every application binary inherently secure (fine-tuning). You could tell the application "please don't access the network" (prompt sandboxing). You could run it in a VM with no connectivity (sandbox). Or you could run it with network policies, read-only filesystems, seccomp profiles, and resource limits (the harness). The industry converged on the last option because it's the only one that provides deterministic enforcement without crippling the workload.
The harness is the kernel. The model is a userspace process. Security belongs in the kernel.
The three pillars of agentic AI safety -- guardrails, permissions, and auditability -- all need implementation.[17] But they need to be implemented at a layer that can actually enforce them. Guardrails as prompts are suggestions. Guardrails as deterministic gates are policy. Permissions in model weights are dispositions. Permissions in the harness are access control. Auditability inside a sandbox is logs. Auditability in the harness is a security telemetry pipeline.
The investment problem
The honest caveat: harness engineering is expensive. The experiment analyzed by Bockeler took five months and produced over a million lines of code. This isn't a YAML file and a weekend.[1]
Organizations will need to decide whether to build harnesses in-house, adopt standardized frameworks, or wait for platform providers to build them. The "service template" analogy from the original article is apt: expect harness templates, forking challenges, and the same maintenance burden as any shared infrastructure. Retrofitting harnesses to legacy codebases may prove uneconomical, analogous to the experience of drowning in static analysis alerts when you turn on a linter for the first time on a ten-year-old codebase.
But the alternative is relying on prompt-level guards, model-level alignment, or sandbox containment alone. 73% of assessed deployments are vulnerable to prompt injection. The alignment tax degrades capability. Sandboxes don't stop semantic attacks. The investment in the harness is the investment in actual security. Everything else is theater dressed up as defense in depth.
The good news is that harness engineering isn't alien to security teams. It's the same discipline we've always practiced -- access control, policy enforcement, telemetry, continuous monitoring -- applied to a new kind of principal. The patterns are familiar. The substrate is new. And the organizations that figure this out first will be the ones whose agents are actually trustworthy, not just sandboxed and hoped for the best.
Sources
- Bockeler, "Harness Engineering" -- Martin Fowler (2025). Link
- Obsidian Security -- "Prompt Injection Attacks: The Most Common AI Exploit" (2025). Link
- TechCrunch -- "OpenAI says AI browsers may always be vulnerable to prompt injection attacks" (December 2025). Link
- Sombrainc -- "LLM Security Risks in 2026: Prompt Injection, RAG, and Shadow AI." Link
- Lakera -- "Indirect Prompt Injection: The Hidden Threat Breaking Modern AI Systems." Link
- ScienceDirect -- "From prompt injections to protocol exploits: Threats in LLM-powered AI agents workflows." Link
- NVIDIA -- "Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk." Link
- Lin & Tan -- "Mitigating the Alignment Tax of RLHF" (EMNLP 2024). Link
- Monetizely -- "The AI Alignment Tax: Understanding the Cost of Safety in AI Capability Development." Link
- Northflank -- "How to Sandbox AI Agents in 2026: MicroVMs, gVisor & Isolation Strategies." Link
- Docker -- "A New Approach for Coding Agent Safety." Link
- E2B -- Open-source secure cloud runtime for AI agents. Link
- Trend Micro -- "Unveiling AI Agent Vulnerabilities Part III: Data Exfiltration." Link
- Civic -- "You Need Deterministic Guardrails for AI Agent Security." Link
- DEV Community -- "Building Deterministic Guardrails for Autonomous Agents." Link
- Snyk -- "The Future of AI Agent Security Is Guardrails." Link
- Dextra Labs -- "Agentic AI Safety Playbook 2025: Guardrails, Permissions & Governance." Link
- OWASP -- "AI Agent Security Cheat Sheet." Link