Harness Engineering: The Discipline That Replaces Writing Code

A team of three engineers built a million-line codebase in five months. Zero hand-written code. They averaged 3.5 pull requests per engineer per day. Separately, a solo developer shipped 6,600+ commits per month running 5-10 agents simultaneously. Elsewhere, an internal system at a major payments company produces over 1,000 merged pull requests per week via Slack-based task automation.^[1]

These aren't cherry-picked demos. They're converging signals from independent teams arriving at the same conclusion: the engineer's primary job is no longer writing code. It's designing the environment in which agents write code. The term for this is harness engineering, and it's the most important discipline in software that most people haven't heard of yet.

What is a harness?

The term was popularized by Mitchell Hashimoto and formalized in Birgitta Bockeler's analysis on Martin Fowler's site. The definition: "the tooling and practices we can use to keep AI agents in check," mixing deterministic and LLM-based approaches. The operational principle, per Hashimoto: "anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."^[2]^[1]

A harness is not a prompt. It's not a system message. It's not a YAML configuration file. It's the entire engineered environment surrounding an agent: the context it receives, the constraints it operates within, the tools it can access, the feedback loops that correct it, and the verification systems that validate its output. If the model is the engine, the harness is the car: steering, brakes, transmission, instruments, and road markings included.

The harness has three core components, each doing different work:

Component 1: context engineering

Context engineering is the practice of curating the optimal set of information available to the model during inference. Not just the prompt, but everything that lands in the context window: background knowledge, retrieved data, tool descriptions, structured inputs, memory from previous sessions, and the code itself. The goal is finding the smallest set of high-signal tokens that maximizes desired outcomes.^[3]

This matters more than most people realize. LLMs face an architectural constraint from the transformer attention mechanism where every token attends to every other token, creating n-squared relationships. As context grows, models experience degraded recall and reasoning precision -- a phenomenon called "context rot." Larger context windows don't solve this; they just move the degradation curve. Context is a finite resource even when the window is a million tokens.^[3]

The most sophisticated implementation I've seen is documented in a recent paper on codified context infrastructure. The team built a three-tier architecture for a 108,000-line C# distributed system:^[4]

A quarter of the codebase exists solely to make agents effective. The hot memory tier -- a single ~660-line file loaded at the start of every session -- contains conventions, known failure modes, and trigger tables that route tasks to the right specialist agent based on file patterns. The cold memory tier holds 34 specification documents queryable on demand, including symptom-cause-fix tables that encode debugging knowledge the way a senior engineer carries it in their head.^[4]

The results are striking. Their save-system specification, the most-referenced document at 74 sessions, prevented save-related bugs across four weeks without a single failure. When they found a subsystem lacking documentation, it was precisely the subsystem where agents introduced the most regressions.^[4]

For long-running agents that span multiple context windows, the techniques get more interesting. Compaction summarizes conversation contents when approaching context limits, preserving architectural decisions while discarding redundant tool outputs. Structured note-taking uses persistent external files (progress trackers, to-do lists) consulted across context resets. Multi-agent architectures assign specialized sub-agents to focused tasks with clean context windows, returning condensed summaries of 1,000-2,000 tokens to the coordinator.^[3]

One team found that initializer agents -- agents that run first to generate comprehensive feature requirement files before any coding begins -- transformed their workflow. The initializer creates a JSON file cataloging 200+ discrete features, each marked as "failing." The coding agent receives strict instructions prohibiting deletion of features from this list, preventing it from hiding functionality gaps by simply removing the requirement.^[5]

Context engineering is where the term "prompt engineering" needed to go but couldn't because prompting implies a single interaction. Context engineering implies a system. And systems are what engineers build.

Component 2: architectural constraints

Here's the counterintuitive finding: constraining agents makes them more productive, not less. The teams with the highest throughput are the ones enforcing the strictest architectural boundaries.

One team enforced a rigid layered architecture where each business domain flows through a fixed set of layers: Types, Config, Repo, Service, Runtime, UI. Dependencies only flow in one direction. This is monitored by both custom deterministic linters and LLM-based review agents. When an agent tries to import a UI module from the Repo layer, a linter catches it before the code is ever committed.^[2]

The linter error messages double as remediation instructions. The violation isn't just flagged; the error tells the agent exactly how to fix it. This creates a self-correcting loop: the agent makes an architectural mistake, the linter explains why it's wrong and how to fix it, and the agent corrects itself without human intervention. The linter is teaching the agent while it works.^[1]

This also extends to what Bockeler calls "taste invariants" -- a small set of non-negotiable conventions that aren't about correctness but about coherence: structured logging, naming conventions for schemas and types, platform-specific reliability requirements. These are the kind of things a senior engineer enforces in code review through sheer cultural pressure. In a harness, they're encoded as automated checks.^[2]

The paradox is real: increasing trust and reliability in AI-generated code requires constraining the solution space rather than expanding it. Agents are most effective in environments with strict boundaries and predictable structure. This is the opposite of the "give the AI maximum flexibility" intuition that most people start with.

Component 3: garbage collection

Code generated by AI accumulates entropy differently than human-written code, but it still accumulates entropy. Between 2020 and 2024, there was an 8-fold increase in code blocks containing five or more duplicated lines. 2024 was the first year where the number of copy-pasted lines exceeded the number of refactored lines.^[6] AI models prioritize local functional correctness over global architectural coherence, generating code that works in isolation but degrades the codebase holistically.^[7]

Garbage collection in harness engineering means running periodic agents whose sole purpose is to find and fix decay: documentation inconsistencies, architectural constraint violations, dead code, duplicated patterns, naming drift. These aren't the coding agents. They're the cleanup crew. They run on schedules or triggers, not on developer prompts.^[2]

This is the component most teams skip and then regret. Without active entropy management, the codebase degrades until the agents themselves start struggling -- their context fills with inconsistent patterns, conflicting conventions, and dead ends. The garbage collection agents close the loop: they keep the codebase in a state where the coding agents can remain effective. The harness maintains itself.

The new job description

If the harness is doing this much work, what are the humans doing?

The job splits into two halves that operate simultaneously, not sequentially. The first half is building the environment: creating structure, tools, and feedback mechanisms so agents proceed reliably. The second half is managing the work: planning, directing, and reviewing agent output at the architectural level.^[1]

The critical practice everyone converges on: separate planning from execution. One practitioner calls it "the single most important thing I do." The pattern is consistent across teams: spend significant time on planning and specification before any agent writes a line of code. Define the feature list. Specify the architectural boundaries. Write the context documents. Then let the agents execute within those constraints.^[1]

The review bar matters too. The teams shipping the most code maintain the same review standards for agent output as for human output. One practitioner acts as architectural gatekeeper while trusting agents with implementation details, treating them as "experienced subcontractors" rather than junior developers. The trust is scoped: high trust for implementation within constraints, zero trust for architectural decisions.^[1]

There's an emerging convention around this: AGENTS.md. A Markdown file at the repository root that coding agents automatically read at the start of every session. It tells agents about build steps, testing commands, coding conventions, architectural constraints, and common pitfalls. The critical pattern: update it every time agents fail. The file is a living document of institutional knowledge encoded for machine consumption.^[1]

Orchestration models

How humans manage agents varies, and two models are emerging:

Attended parallelization means actively managing 3-4 concurrent agent sessions. The developer is in the loop during execution, steering, course-correcting, and spawning new agents as tasks complete. This is the solo developer model: high throughput, high cognitive load, but maximum control over architectural coherence.

Unattended parallelization means posting tasks and re-entering only at the review stage. Developers describe tasks via Slack or a ticket interface, agents execute asynchronously, and the developer reviews completed PRs. This is the team model: lower per-developer throughput but massive parallelism across the organization. One company's system processes 1,000+ merged PRs per week this way.^[1]

Both models require the same harness infrastructure. The difference is how tightly coupled the human is to the execution loop. Both models fail identically without strong constraints, context, and verification: the agents drift, the code decays, and the review burden overwhelms the humans.

The entropy problem

Here's the part nobody wants to talk about: AI-generated code might be making developers worse.

A randomized controlled trial with 52 junior engineers found that AI-assisted developers scored approximately 17% lower on comprehension tests compared to manual coders. The AI group averaged 50% on quiz scores versus 67% for the manual group. The largest gap was in debugging questions. Developers who delegated code generation to AI scored below 40%, while those who used AI for conceptual questions achieved 65% or higher.^[8]

This matters for harness engineering because the harness assumes humans who can do architectural review, design constraints, and diagnose agent failures. If the next generation of engineers develops weaker debugging instincts because they delegated implementation to agents during their formative years, the review layer of the harness weakens. The constraint quality degrades. The garbage collection agents miss things because the humans overseeing them don't recognize the patterns.

The codebase-level entropy is measurable too. Research shows AI models produce approximately 90-93% code smells, 5-8% bugs, and around 2% security vulnerabilities across generated code. The incentive structure encourages accepting quick, duplicated snippets over refactored, coherent patterns.^[6]^[7] Without a strong harness -- without linters that catch duplication, structural tests that enforce modularity, and garbage collection agents that clean up drift -- the codebase decays faster than humans can review it.

This is why harness engineering isn't optional. It's not a nice-to-have for teams that want to be extra careful. It's the structural requirement for AI-assisted development that doesn't collapse under its own entropy. The harness is load-bearing.

What's still unsolved

Harness engineering is a discipline that's maybe 18 months old as a named practice. The convergence is real but the gaps are significant.

Brownfield codebases. Everything documented so far works best on greenfield projects. Retrofitting a harness to a legacy system with inconsistent structure, undocumented conventions, and ten years of accumulated entropy is a different problem entirely. It's analogous to turning on a linter for the first time on a million-line codebase: the signal-to-noise ratio is immediately overwhelming.^[1]^[2]

Verification at scale. Agents routinely mark features "complete" without proper end-to-end testing. Vision and tool access limitations create verification gaps. One team found that agents didn't reliably catch bugs until explicitly instructed to use browser automation via Puppeteer, shifting from unit tests and curl to actual end-to-end user workflows.^[5]

Context infrastructure maintenance. The three-tier codified context system requires ~1-2 hours per week of maintenance, primarily updating specifications during code changes. Specification staleness was identified as the primary failure mode: when the docs drift from the code, the agents start making mistakes that the harness was designed to prevent.^[4]

Cultural adoption. Success requires dedicated engineers building and maintaining harnesses. Not every engineer thrives in this mode. Engineers who love algorithmic puzzles and hands-on craft work struggle to go "agent-native." Product-focused developers who think in terms of outcomes rather than implementation adapt quickly. The personality fit for harness engineering is closer to technical program manager than to competitive programmer.^[1]

Standardization. Every team is building harnesses from scratch. There's no shared framework, no standard toolchain, no common vocabulary beyond AGENTS.md. The "service template" analogy from Bockeler's article suggests organizations will eventually adopt standardized harness templates as starting points, but this brings the same forking and synchronization challenges as any shared infrastructure.^[2]

Where this goes

Bockeler's article notes that the harness designers she interviewed acknowledge their guardrails will "almost surely dissolve over time as models improve." This is probably right for the low-level constraints: as models get better at following conventions and avoiding obvious mistakes, the linter rules that catch those mistakes become less necessary.^[2]

But the higher-level components -- context architecture, verification systems, entropy management, feedback loops -- are not temporary scaffolding around weak models. They're permanent infrastructure for a new kind of software development. Better models will need less hand-holding but more sophisticated orchestration. The harness evolves; it doesn't disappear.

A coding agent running LangChain's harness improved from 52.8% to 66.5% on a benchmark by modifying only the harness while keeping the underlying model constant. A 13.7-point gain from environment design alone.^[9] That's the signal. The frontier isn't just better models. It's better harnesses around the same models. And unlike model training, harness engineering is something every team can do, starting now, with the models they already have.

The engineer's job is no longer to write code. It's to design the environment in which code gets written correctly. That's not a demotion. It's the same shift that happened when we went from writing assembly to writing compilers, from managing servers to writing infrastructure-as-code. The abstraction layer moved up. The leverage moved up with it.

Sources

Guo, "The Emerging Harness Engineering Playbook" -- Ignorance.ai (2026). Link
Bockeler, "Harness Engineering" -- Martin Fowler (2025). Link
"Effective Context Engineering for AI Agents" -- Anthropic Engineering (2025). Link
"Codified Context: Infrastructure for AI Agents in a Complex Codebase" -- arXiv (2026). Link
Young, "Effective Harnesses for Long-Running Agents" -- Anthropic Engineering (2025). Link
Curlee, "The Inevitable Rise of Poor Code Quality in AI-Accelerated Codebases" -- SonarSource (2025). Link
"How AI Generated Code Compounds Technical Debt" -- LeadDev (2025). Link
InfoQ -- "Anthropic Study: AI Coding Assistance Reduces Developer Skill Mastery by 17%" (February 2026). Link
"Improving Deep Agents with Harness Engineering" -- LangChain Blog (2026). Link