The Model Is Not the Agent
When an AI model solves a complex task autonomously — browsing the web, writing code, running tests, fixing errors, and iterating until the output passes review — it is easy to credit the model. The model reasoned well. The model wrote good code. The model figured it out.
But almost always, a second system made that possible. It decided what context to give the model. It routed the model's output to the right tool. It checked whether the result was acceptable. It handled the errors. It ran the loop again when the first attempt failed.
That second system is the agent harness.
The harness is why the same model can fail at a task when called once and succeed at the same task when wrapped in the right scaffolding. It is also why, when researchers report benchmark gains without changing the model, they almost always changed the harness.
What an Agent Harness Is
An agent harness is the orchestration layer that sits between your AI model and the environment it needs to act in. It manages the full execution lifecycle of an agentic task:
- Receives the goal or task
- Prepares the context the model will see (relevant memory, prior steps, available tools)
- Calls the model with a structured prompt
- Parses the model's output — tool calls, text, decisions
- Executes tool calls — runs code, calls APIs, reads files, searches the web
- Captures the results and feeds them back to the model
- Checks a verification criterion — did the task succeed? did tests pass?
- Loops back if not done, or exits if done or if the iteration limit is hit
- Handles failures — timeouts, API errors, model refusals, unexpected output formats
- Returns the final result to whatever called the harness
Without the harness, you have step 3 and step 4 only — a single prompt and a single response. The harness is what turns a language model into an agent.
Why the Harness Matters More Than You Think
In 2026, the most striking evidence for harness importance comes from benchmarks. LangChain's Deep Agents team achieved significant gains on Terminal-Bench 2.0 using the same underlying model — only the harness changed. The scaffolding around the model — how context was assembled, how tool outputs were formatted, how retries were managed — produced better results than a model upgrade would have.
This is not an isolated finding. It is the pattern:
Better harness on the same model > same harness on a better model — in many real-world tasks.
The reason is structural. The model only sees what the harness gives it. If the harness gives the model noisy context, the model produces noisy output. If the harness truncates relevant information to fit a context window, the model reasons from an incomplete picture. If the harness has no verification step, the model has no signal that it was wrong. The model cannot compensate for harness failures with capability alone.
The Core Components
1. Task Definition Layer
The entry point. The harness receives a goal (sometimes called an objective, spec, or task) and converts it into the first prompt the model sees. Good task definitions:
- State the success criterion explicitly ("the function should pass all unit tests")
- Provide available tools and their schemas
- Specify constraints (budget, time limit, files that are off-limits)
- Include relevant context without noise
The task definition layer is where loop engineering starts — you define the exit condition before the loop begins.
2. Context / Memory Manager
The model has a context window. The harness decides what fills it.
For short tasks, this is simple: put the task and prior tool outputs in the prompt. For long tasks spanning many tool calls or long documents, the harness must:
- Summarise earlier steps rather than including their full output
- Retrieve relevant memory from a store rather than keeping everything in context
- Prioritise recent results over older ones
- Chunk large tool outputs and include only the relevant sections
Poor context management is the most common cause of harness failure on long tasks. The model loses track of the goal, repeats steps it already completed, or starts contradicting its own prior work.
3. Tool Execution Layer
The harness calls tools on behalf of the model. This includes:
- Code execution — running Python, bash, or other code and capturing stdout/stderr
- File operations — reading, writing, listing directory contents
- API calls — web search, database queries, external services
- Browser interaction — navigation, clicking, form submission
- Sub-agent calls — spawning another model call for a specialised subtask
The tool layer is responsible for sandboxing (ensuring tool calls can't cause unintended damage), timeout handling (a hanging subprocess shouldn't freeze the whole harness), and output normalisation (converting raw tool results into a format the model can use).
4. Loop Controller
The harness decides when to call the model again and when to stop.
Iteration triggers:
- Tool call completed — feed results back for the next model call
- Model produced a plan but hasn't acted yet — prompt it to execute
- Verification failed — prompt it to correct the error
Exit conditions:
- Verification passes (tests green, spec met, review approved)
- Maximum iteration count reached
- Token budget exhausted
- Model explicitly signals completion
The loop controller is where the "agent-ness" lives. A model without a loop controller isn't an agent — it's an API call.
5. Verification Layer
The most important component and the one most often skipped.
The verification layer checks whether the task is actually done. A good verification check is:
- Deterministic — produces the same result given the same input
- Cheap — doesn't cost significant tokens or time
- Meaningful — actually tests the success criterion, not a proxy
Examples of strong verification:
- Run the test suite. All tests pass = done.
- Compile the code. No errors = done.
- Call the API. Returns 200 = done.
- Diff the output against the spec. Zero diff = done.
Examples of weak verification:
- Ask the model "does this look right?" — this is expensive and unreliable
- Check that the model said "done" — models say "done" when they're not done
- Check that output is non-empty — trivially satisfied
Loop engineering is essentially the practice of designing good verification layers and connecting them to loop controllers.
6. Failure Handler / Exit Escalation
What happens when the loop can't converge? The harness needs explicit handling for:
- Hard exits: maximum iterations reached, token budget exhausted — return partial result with error state
- Unrecoverable errors: tool call returns an error the model can't fix — escalate to human or fail gracefully
- Model refusals: model declines to perform a step — log, try an alternative phrasing, or exit
- Output format failures: model produces output that doesn't parse — retry with a corrected format instruction
Without explicit failure handling, harnesses fail in opaque ways: infinite loops, silent partial results, or crashes that surface as confusing downstream errors.
Harness Patterns in the Wild
The Simple Retry Loop
The most basic harness: call the model, run the verification, loop if it fails.
goal → model call → tool execution → verify
↑________________________| (if fail, retry)
↓ (if pass, exit)
This is what Claude Code's /loop command implements. It works well for tasks with fast, cheap verification (test suites, lint checks).
The Plan-Then-Execute Pattern
The model first generates a plan (a list of steps), then the harness executes each step in sequence, calling the model for each one.
goal → model (plan) → [step 1 → model → tool] → [step 2 → model → tool] → verify → exit
Used in agentic coding workflows where the task is complex enough to benefit from explicit decomposition.
The Multi-Agent Harness
The harness coordinates multiple model calls in parallel or in sequence, each specialised for a subtask. A coordinator model routes work to specialist agents (coder, reviewer, tester, documenter) and aggregates results.
coordinator model
├── coder agent → code
├── reviewer agent → review
└── tester agent → test results
→ aggregate → verify → exit
This pattern is described in the Anthropic managed agents architecture and is the likely pathway to ASI via multi-agent collectives.
The Meta-Harness (Self-Improving Loop)
A harness that can modify itself — updating its own tool list, memory schema, or verification criteria based on what worked and what didn't. This is what the self-harness research explores and what Matt Pocock cautioned against when it applies to auto-generated CLAUDE.md instructions.
Harness vs Framework vs Agent Platform
| Custom Harness | Framework (LangChain, LangGraph) | Agent Platform (Claude Code, Devin) | |
|---|---|---|---|
| What it is | Code you write from scratch | Library of harness components | Fully built harness with UI/CLI |
| Flexibility | Maximum | High (configurable) | Low (fixed patterns) |
| Time to first run | Days–weeks | Hours–days | Minutes |
| Best for | Unique verification logic, specific domains | Standard agentic patterns | Common dev tasks |
| Maintenance | Full ownership | Framework updates | Platform handles it |
The choice depends on how standard your task is. The more your task looks like "write code, run tests, fix until green," the more an existing platform handles it. The more you need custom verification, unusual tool combinations, or specific orchestration logic, the more you want a custom harness.
What the Harness Doesn't Do
Clarifying the boundary:
- The harness does not decide what the goal is — that is the task definition you provide
- The harness does not reason about the problem — that is the model
- The harness does not guarantee the model's output is correct — that is what the verification layer checks
- The harness does not improve the model's capability — it shapes what the model is asked to do and how many attempts it gets
A common misconception is that a good harness compensates for a weak model. It doesn't — it extracts more of what the model is capable of. There is a floor: if the model genuinely cannot solve the problem even with unlimited retries and perfect context, the harness cannot fix that.
Building Your First Harness
If you are building a harness for the first time, the sequence that works:
- Define the success criterion first — what does "done" look like in machine-readable terms?
- Write the verification check — can you test it independently before the loop exists?
- Build the simplest loop — call model, run tools, check verification, repeat
- Add a hard exit — maximum iterations, token budget, or time limit
- Add context management — start with full context; only add summarisation when you hit window limits
- Add failure handling — what happens when tools error? when the model refuses?
- Instrument it — log iteration count, token usage, tool call results per iteration
Do not add planning layers, parallel execution, or multi-agent orchestration until the simple loop works reliably. Complexity in harnesses compounds — a subtle bug in a simple harness is easy to find; the same bug inside a planning layer inside a multi-agent system is not.