Harness Engineering: Build AI Agents That Work in Prod

If you work in product, engineering, or anywhere near software delivery in 2026, you are about to hear "harness engineering" a lot. Here is the one-paragraph version before we go deep: a harness is everything around an AI agent except the model itself. It is the set of tools, context-loading logic, permissions, lints, tests, and review sub-agents that decide what information the model sees, what actions it can take, and what happens when it gets something wrong. Harness engineering is the discipline of building and iterating on that surrounding system. The model is increasingly a commodity. The harness is where the competitive advantage lives. If you are new to AI-powered development or are just starting to think about agents as part of your workflow, the rest of this post will give you the vocabulary and patterns you need.

Two events in early 2026 turned this from a blog-post concept into a professional discipline that teams are hiring for. The first was the accidental leak of Claude Code's entire source code on March 31. The second was Ryan Lopopolo's keynote at AI Engineer Europe on April 16, where he laid out, in concrete patterns any team can copy, exactly how OpenAI built a one-million-line codebase with zero manually written code. Together they give us the clearest picture yet of what serious agent infrastructure actually looks like, and what any builder, PM, or engineering lead should be doing differently this week.

Key Takeaways

The harness (tools, context-loading logic, permissions, memory, and review sub-agents) is where competitive advantage lives. The model is a commodity; the surrounding infrastructure is not.
The Claude Code source leak confirmed this: of 512,000 lines of TypeScript, the actual model call is a tiny fraction. The overwhelming majority is infrastructure.
Ryan Lopopolo (OpenAI) built a 1-million-line codebase with zero manually written code by treating code as a compiled artifact of the spec and the LLM as a fuzzy compiler.
Every time you type "continue" to an agent is a harness failure, not a model failure. Track those interventions as bugs in your context-loading or task specification.
Scheduling a weekly Garbage Collection Day (reviewing recurring errors and codifying fixes into lints or tests) compounds agent reliability over time. Teams that skip it repeat the same mistakes indefinitely.
Harness design is a PM-shaped problem. Writing machine-readable acceptance criteria, defining reviewer sub-agent personas, and deciding what requires human approval are product decisions that directly multiply agent output.

Learn this hands-on

Become a 10x PM by learning how to use Claude Code in your daily work as a Product Manager, through 3 highly efficient live sessions of 1h30. Join the Claude Code for PMs live cohort.

Join the Summer Cohort

The Claude Code Leak: A Blueprint Nobody Expected

On March 31, 2026, Anthropic shipped an npm update for @anthropic-ai/claude-code that accidentally included a full source map. The result: 512,000 lines of unobfuscated TypeScript across roughly 1,906 files, now readable by anyone who pulled version 2.1.88 between 00:21 and 03:29 UTC. A missing .npmignore entry, a Bun runtime that generates source maps by default, and one of the most consequential accidental disclosures in AI history.

What the leak revealed matters less as a security story and more as a reference architecture. Researchers and engineers who combed through the code over the following days documented:

Roughly 40 discrete permission-gated tools, each sandboxed by a multi-layer permission model
A 46,000-line query engine handling all LLM API calls, streaming, retries, token caching, and context management
A three-layer hierarchical memory architecture explicitly designed to fight "context entropy" (the degradation of agent performance as a context window fills with irrelevant history)
Sub-agent spawning logic for delegating subtasks to parallel agents
Context compaction algorithms that compress long conversation histories without losing task-critical facts
A streaming agent loop with continuous feedback, not a fire-and-forget request model

The biggest takeaway is not any single feature. It is the proportion of code. The model call itself, the actual API request to Claude, is a tiny fraction of 512,000 lines. The overwhelming majority is infrastructure: how context gets assembled, how permissions are enforced, how memory is structured, how errors are caught, and how the agent is guided back on track when it drifts. The Register's coverage described it simply: Anthropic's real product is not the model, it is the harness.

That framing is the core insight. If you want to build production-quality agents, you do not start with the model. You start with the harness.

Ryan Lopopolo and the Token Billionaire Mindset

Two weeks after the leak, Ryan Lopopolo from OpenAI's Frontier Product Exploration group took the stage at AI Engineer Europe in London for a talk titled "Harness Engineering: How to Build Software When Humans Steer, Agents Execute." The talk landed like a practical companion to the Claude Code leak: not "here is what a big company built," but "here is the pattern, here is the anti-pattern, here is how to copy it in your own team."

Lopopolo's team ran a five-month experiment: build and ship a real internal product with zero manually written code. The result was a codebase over one million lines across 1,500-plus pull requests, every line authored, reviewed, and merged by agents. His framing for why this became possible: three things aligned in late 2025 and early 2026. GPT-5.2's capability jump. Significant improvements in harness sophistication. And a breakthrough in long-horizon reliability, where agents could hold context and pursue a goal across hours rather than minutes.

His opening argument is worth quoting directly: "Code is free." The scarce resources are human time, human attention, model context window, and model-generated tokens. Once you internalize that, your job as a builder or PM stops being "what code should I write?" and starts being "what is the best possible context to surface to the model at the right moment?" That shift, from code author to harness designer, is what he calls the token billionaire mindset.

According to a Gartner analysis from 2025, 40% of enterprise applications were predicted to feature task-specific AI agents by 2026, up from under 5% in 2025. The gap between pilot and production tells you where the problem sits: plenty of teams have agents running in a demo, but far fewer have harnesses good enough to trust agents in production. Harness engineering is the bridge.

7 Harness Patterns You Can Copy This Week

Lopopolo shared concrete patterns his team uses daily. Martin Fowler's companion piece on harness engineering for coding agent users adds a research-backed framing around each. Here is what the combined picture looks like for a real team.

1. Deep Skills, Not Broad Ones

Lopopolo's team has 5 to 10 deep, well-specified skills rather than a long list of shallow ones. Each skill is a focused capability with clear input/output contracts, not a generic "do coding things" prompt. The harness surfaces the right skill in the right context, and the model has detailed guidance for each one.

The practical implication: if you have been building a sprawling library of agent prompts, stop. Pick the five workflows that matter most for your product and make each one bulletproof. Depth beats breadth for reliability.

2. Reviewer Sub-Agents on Every Push

Every PR in Lopopolo's system is reviewed by a set of sub-agents, each embodying a specific expert persona: a frontend architecture reviewer, a reliability engineering reviewer, a scalability reviewer. Each persona has its own set of criteria. Each can block a merge if those criteria fail.

The insight here is persona-shaped knowledge. You do not need the best frontend engineer on every PR. You need their knowledge codified once, then available everywhere. Write down what "good frontend code" looks like from the perspective of your best reviewer, put it in a reviewer sub-agent, and now every agent-written PR gets reviewed with that standard. Institutional knowledge stops living in one person's head. A complementary approach is to encode your expertise directly into the tools the model uses via custom MCP servers, so the knowledge is available across every session without needing to re-specify it.

3. Custom Lints That Prompt-Correct the Agent

Standard lints tell a human developer "you made an error, go fix it." Harness-engineered lints tell the agent "you made an error, here is exactly how to fix it and why." Lopopolo's example: a lint that triggers when an agent uses unknown in TypeScript, which outputs not just a failure message but a full explanation: "We parse-don't-validate at the edge using Zod. Use the inferred Zod type instead of unknown here."

The lint failure becomes the prompt correction. The agent reads the error message, understands the pattern, fixes the code. The developer never sees the issue. This is one of the highest-leverage investments in a harness: every lint you write with a good error message is a teacher the agent never forgets.

4. Tests About Code Structure

Beyond functional tests, Lopopolo's team writes tests about the structure of the codebase itself. One example: assert that no file exceeds 350 lines. Why 350? Because files larger than that start to fill too much of the context window when the agent needs to reason about them. The harness adapts the codebase to the model's constraints, not the other way around.

This is a mindset shift for most developers. Tests are not just about whether the code does the right thing. They are also about whether the code is shaped correctly for the agent to reason about it efficiently. If your agent keeps making mistakes in large files, write a test that fails when files get too long. The compaction problem is an architecture problem, not a model problem.

5. The "Continue" Rule

This is perhaps the most diagnostic pattern in Lopopolo's talk: "Every time you have to type 'continue' to the agent is a failure of the harness." If the agent stalls, asks for clarification, or stops mid-task, it means the harness did not give it enough context to finish. The agent is not lazy. The harness is incomplete.

Track every time you manually prompt an agent back to completion. Each one is a bug report on your harness. What context was missing? What ambiguity existed in the task specification? Fix those systematically, and the agent runs further and further without intervention.

6. Garbage Collection Day

Lopopolo's team schedules "Garbage Collection Day" every Friday. The ritual: review every PR-blocking pattern, every repeated error, every recurring "slop" output from the week. For each one, root-cause it and codify a fix into a lint, a test, or a skill so it never recurs.

The compound effect is significant. After three months of weekly garbage collection, the agents on Lopopolo's team make different mistakes than they did on day one. The harness gets smarter every week. Teams that skip this step have agents that make the same mistakes indefinitely, because there is no feedback loop from output quality back into the harness.

7. Code as a Compiled Artifact

The final pattern is a conceptual one, but it shapes everything else. Lopopolo frames the codebase as "a compiled artifact of the spec," with the LLM acting as a fuzzy compiler. The harness is the static analysis layer and the optimization pass. If you want to switch models, you are effectively swapping the codegen backend, not rewriting your engineering practices.

The practical implication: write your specs, acceptance criteria, and lint rules as if they are the source of truth. The code is output. The spec is input. If the output is wrong, the problem is almost always in the source: the spec was ambiguous, the context was incomplete, or the harness let a bad pattern through.

Martin Fowler, who published his own deep-dive on harness engineering at martinfowler.com in April 2026, frames the purpose of a harness in two clear goals: "A well-built outer harness serves two goals: it increases the probability that the agent gets it right in the first place, and it provides a feedback loop that self-corrects as many issues as possible before they even reach human eyes."

Why Harness Engineering Is a PM-Shaped Problem

Here is the part of this topic that most "harness engineering" content glosses over: the hardest problems in harness design are not engineering problems. They are product problems.

Writing a lint rule is engineering. Deciding what the lint rule should enforce requires product judgment. Defining the acceptance criteria a reviewer sub-agent checks against is a product requirements document, just one that gets evaluated by a model instead of a human. Deciding what context the agent needs at each step is the work of someone who understands both the user's goal and the system's constraints. That person is usually a PM or a Head of Product.

Martin Fowler's article on harness engineering makes this explicit. The harness externalizes the implicit knowledge that experienced developers carry in their heads. But it also externalizes the implicit knowledge that experienced PMs carry: what "done" looks like, what edge cases matter, what quality bar is acceptable for a given feature. Both need to be codified in the harness before the agents can work reliably at scale. This is part of why the PM-engineer boundary is already blurring in AI-native teams: the people who can specify outcomes clearly enough for agents to act on them are increasingly the most valuable contributors to the harness.

The shift this creates for product teams is real. A PM who can write a clear, testable specification, who can define what a reviewer sub-agent should check, who can identify which recurring errors represent spec ambiguity rather than model failure, is a direct multiplier on the team's agent output. That skill set does not require writing TypeScript. It requires the ability to make implicit product knowledge explicit and structured.

According to Gartner's 2025 analysis, enterprise app deployments featuring AI agents were set to grow from under 5% to 40% in a single year (Gartner, 2025). That rate of adoption means most product teams are reconfiguring how PMs and engineers collaborate right now. The teams where PMs actively participate in harness design are shipping faster and more reliably than the teams where the harness is treated as a pure engineering concern.

A few concrete ways PMs and Heads of Product can engage with harness design today:

Write acceptance criteria that an agent can evaluate. If your current acceptance criteria require human judgment to interpret, they are too vague for a reviewer sub-agent. The discipline of making them machine-readable also makes them better for human reviewers.
Participate in Garbage Collection Day. Most of the patterns that cause recurring agent failures trace back to ambiguous requirements or missing context in the spec. PMs are the ones who can fix those upstream.
Define what the agent should refuse. The permission model in any harness is a product decision, not just an engineering one. What can the agent do without human approval? What requires a human in the loop? That is a risk and product scope decision, and it belongs in the product spec before it becomes a lint rule.
Track "continue" events as a product metric. Every manual intervention to keep an agent running is a data point about where your specs are incomplete. That metric is as valuable as any other leading indicator of delivery velocity.

Where Harness Engineering Goes From Here

The Claude Code leak and Lopopolo's keynote arrived within two weeks of each other, and that timing matters. One showed us what a mature harness looks like from the inside. The other showed us how to build one with a small team and compound it week over week. The result is that the question is no longer "should we use agents?" but "how good is our harness?"

Martin Fowler published his original harness engineering memo in early 2026 and followed it with a full article in April. OpenAI published the companion blog post alongside Lopopolo's talk. The vocabulary is solidifying. The discipline is forming. The teams that get ahead of it now will have harnesses that compound for months before their competitors realize what is happening.

If you want a structured way to develop these skills on your own product, the Emily agent course at Vibe Coding Academy covers the full agent architecture stack: from the first prompt to production-grade context management, permission models, and multi-agent coordination. It is the fastest path from "I have an agent that works in demos" to "I have a harness that works in production."

Get Your Team Building With Agents in 3 Sessions

The discipline of harness engineering is landing in product teams faster than most people expected. If you are a Head of Product or a PM leading a team that builds with AI agents, the question you should be asking right now is not "what model should we use?" It is "how well-specified is our harness, and who owns improving it?"

The teams winning in this environment are the ones treating harness design as a first-class product problem: writing specs that agents can evaluate, codifying reviewer personas from their best engineers and PMs, and running garbage collection on their agent output every week.

At Vibe Coding Academy, we run a Team Training program built specifically for this. In 3 live sessions of 90 minutes each, spread over 2 weeks, each PM on your team ships a real feature with an agent, builds at least one agent harness of their own, and receives personalized written feedback on their approach. The program is structured around the patterns in this post, applied to your actual codebase and product context. If you want your team working this way within the month, the full details are at vibecodingacademy.ai/claude-code-for-pms.

Harness engineering is not going to stay a niche topic for long. The Claude Code leak gave everyone a blueprint. Ryan Lopopolo gave everyone a playbook. The only variable now is who moves first.