What is Self-Harness?

Self-Harness is a new paradigm where LLM-based agents autonomously improve their own operating harnesses without relying on human engineers or stronger external models. It uses a three-stage iterative loop—Weakness Mining, Harness Proposal, and Proposal Validation—to identify model-specific failures and generate minimal, targeted harness modifications that address them.

How much does Self-Harness improve agent performance?

On Terminal-Bench 2.0, Self-Harness achieved substantial improvements across three diverse base models: MiniMax M2.5 improved from 40.5% to 61.9% (+21.4 points), Qwen3.5-35B-A3B from 23.8% to 38.1% (+14.3 points), and GLM-5 from 42.9% to 57.1% (+14.2 points). These gains came from harness-only modifications, with the base models held constant.

How does Self-Harness differ from traditional harness engineering?

Traditional harness engineering relies on human experts manually analyzing failures and designing improvements, which doesn't scale as models become more diverse. Self-Harness automates this process entirely—the agent itself identifies weaknesses, proposes fixes, and validates them through regression testing, without requiring human intervention or stronger external models.

What models were tested with Self-Harness?

The paper tested Self-Harness on three diverse base models from different families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. These models were chosen to demonstrate that Self-Harness works across different model architectures and capabilities, not just on a single model family.

Self-Harness: AI Agents That Autonomously Improve Their Own Framework | explainx.ai Blog

Q: What are the three stages of Self-Harness?

Self-Harness operates in three stages: (1) Weakness Mining—analyzes execution traces to identify model-specific failure patterns, (2) Harness Proposal—generates diverse yet minimal harness modifications tied to discovered weaknesses, and (3) Proposal Validation—accepts candidate edits only after regression testing to ensure improvements don't break existing capabilities.

explainx.ainewsletter3.5k

workshops ↗

Self-Harness: AI Agents That Autonomously Improve Their Own Framework | explainx.ai Blog | explainx.ai

TL;DR: Published June 8, 2026 on arXiv, "Self-Harness: Harnesses That Improve Themselves" introduces a paradigm where LLM-based agents autonomously optimize their own operating frameworks without human engineers or stronger external models. Using a three-stage loop (Weakness Mining, Harness Proposal, Proposal Validation), Self-Harness achieved consistent performance improvements on Terminal-Bench 2.0: MiniMax M2.5 improved from 40.5% to 61.9% (+52.6%), Qwen3.5-35B-A3B from 23.8% to 38.1% (+60.1%), and GLM-5 from 42.9% to 57.1% (+33.1%)—demonstrating that agents can effectively turn model-specific weaknesses into concrete, executable harness improvements.

The Harness Engineering Problem

The performance of LLM-based agents is jointly shaped by two critical factors:

Base Model Capabilities — The underlying LLM's reasoning, coding, and knowledge
Agent Harness — The scaffolding that mediates interaction with the environment

While much attention focuses on improving base models, recent evidence shows that harness engineering can yield 10-15 point improvements on benchmarks while keeping the base model fixed.

The Scaling Problem with Human-Designed Harnesses

Current State:

Harnesses are largely engineered by human experts
Effective harness design is inherently model-specific
Different models exhibit distinct failure patterns and behaviors
Modern LLMs are increasingly diverse and rapidly evolving

Why This Doesn't Scale:

mermaid

graph TD
    A[New Model Released] --> B[Human Engineers Analyze]
    B --> C[Design Model-Specific Harness]
    C --> D[Manual Testing & Iteration]
    D --> E{Performance OK?}
    E -->|No| B
    E -->|Yes| F[Deploy]
    G[Another Model Released] --> B
    style B fill:#ff6b6b
    style C fill:#ff6b6b
    style D fill:#ff6b6b

The Bottleneck: Human engineers can't keep pace with model diversity and evolution. Each new model family (GPT, Claude, Gemini, Qwen, GLM, MiniMax, etc.) exhibits unique behaviors requiring custom harness design.

Introducing Self-Harness: Agents That Fix Themselves

Core Innovation: What if the agent could improve its own harness, without relying on human engineers or stronger external models?

The Self-Harness Paradigm

Definition: Self-Harness is an iterative framework where an LLM-based agent autonomously:

Identifies its own failure patterns through execution trace analysis
Proposes targeted modifications to its operating harness
Validates improvements through regression testing
Iterates until performance converges

Key Insight: Instead of human experts manually engineering model-specific fixes, the model itself discovers and implements what it needs to succeed.

The Three-Stage Self-Harness Loop

Architecture Overview

mermaid

graph LR
    A[Execution Traces] --> B[Stage 1: Weakness Mining]
    B --> C[Identified Failure Patterns]
    C --> D[Stage 2: Harness Proposal]
    D --> E[Candidate Harness Modifications]
    E --> F[Stage 3: Proposal Validation]
    F --> G{Regression Tests Pass?}
    G -->|Yes| H[Accept Changes]
    G -->|No| I[Reject Changes]
    H --> J[Updated Harness]
    I --> D
    J --> K[Run Benchmark Tasks]
    K --> A

Stage 1: Weakness Mining

Purpose: Identify model-specific failure patterns from execution traces.

Process:

Collect Execution Traces — Run the agent on benchmark tasks, capturing:
- Tool calls made
- Responses received
- Reasoning steps
- Terminal output
- Error messages
- Success/failure status
Pattern Analysis — The agent analyzes its own traces to discover:
- Recurring error types
- Missing error handling
- Inefficient tool usage patterns
- Context management failures
- Planning mistakes
- Stuck loops or infinite retries
Failure Categorization — Weaknesses are grouped by type:
- Tool Selection Errors: Wrong tool for the task
- Context Errors: Lost or misinterpreted information
- Planning Errors: Poor task decomposition
- Error Handling: Failed to recover from failures
- Verification Gaps: Didn't validate assumptions

Example Weakness Discovery:

snippet

Weakness ID: W-042
Pattern: Agent frequently fails git operations by forgetting to configure user.name
Frequency: 12 failures across 89 tasks
Impact: Blocks commit-related tasks
Category: Tool prerequisite missing

Stage 2: Harness Proposal

Purpose: Generate diverse yet minimal harness modifications tied to discovered weaknesses.

Design Principles:

✅ Targeted: Each proposal addresses a specific weakness
✅ Minimal: Small, focused changes rather than large rewrites
✅ Diverse: Generate multiple candidate solutions per weakness
✅ Testable: Changes must be verifiable through benchmark tasks

Proposal Types:

1. System Prompt Modifications

diff

# Before
You are an AI agent with access to terminal commands.

# After (Self-Harness Proposal)
You are an AI agent with access to terminal commands.
+ Before running git commit, always verify git user.name and user.email are configured.
+ If not set, configure them using: git config user.name "Agent" && git config user.email "agent@localhost"

2. Tool Wrapper Additions

python

# Self-Harness proposes wrapping git commands
def execute_git_command(cmd):
    # Ensure git is configured before any commit operation
    if "commit" in cmd:
        check_git_config()
    return subprocess.run(cmd, shell=True)

3. Validation Step Injection

python

# Self-Harness proposes adding verification after file operations
def create_file(path, content):
    write_file(path, content)
    # Validate file was created successfully
    if not os.path.exists(path):
        raise FileNotFoundError(f"Failed to create {path}")
    # Validate content matches
    if read_file(path) != content:
        raise ValueError("File content mismatch")

4. Planning Template Updates

diff

# Before
Plan: {steps}

# After (Self-Harness Proposal)
Plan:
+ 1. Verify prerequisites (dependencies, configs, permissions)
{steps}
+ N+1. Verify expected outcomes
+ N+2. Clean up temporary resources

Diversity Mechanism: For each weakness, Self-Harness generates 3-5 candidate proposals using different approaches:

Prompt-based guidance
Tool wrapper modifications
Validation additions
Planning template changes
Error recovery procedures

Stage 3: Proposal Validation

Purpose: Accept candidate edits only after regression testing to prevent breaking existing capabilities.

Validation Pipeline:

1. Held-Out Test Set

snippet

Training Set: 70% of Terminal-Bench 2.0 tasks (62 tasks)
Validation Set: 30% held-out tasks (27 tasks)

2. Regression Testing

python

def validate_proposal(current_harness, proposed_harness, tasks):
    baseline_results = run_benchmark(current_harness, tasks)
    proposal_results = run_benchmark(proposed_harness, tasks)

    # Accept only if:
    # 1. No regression on previously passing tasks
    # 2. Net improvement in pass rate
    return (
        no_regression(baseline_results, proposal_results) and
        net_improvement(baseline_results, proposal_results)
    )

3. Acceptance Criteria

A proposal is accepted if:

✅ No Breaking Changes: All previously passing tasks still pass
✅ Net Positive: Overall pass rate improves
✅ Weakness Addressed: At least one failure from the target weakness now passes
✅ No New Failures: Doesn't introduce failures in unrelated tasks

4. Iterative Application

Once validated, the proposal is:

Merged into the harness
Used for subsequent weakness mining
Compound improvements build on previous fixes

Safety Mechanism:

python

# Only minimal, targeted changes are accepted
if change_diff_lines > MAX_CHANGE_SIZE:
    reject_proposal("Too large, split into smaller changes")

Experimental Results: Terminal-Bench 2.0

The paper evaluated Self-Harness on Terminal-Bench 2.0, the industry-standard benchmark for AI agent evaluation comprising 89 carefully curated tasks across diverse domains.

Experimental Setup

Base Models Tested:

MiniMax M2.5 — Chinese frontier model from MiniMax
Qwen3.5-35B-A3B — Open-weight model from Alibaba Cloud
GLM-5 — Bilingual model from Zhipu AI

Why These Models?

Diverse model families (not all OpenAI/Anthropic)
Different architectures and training approaches
Varying baseline capabilities (23.8% to 42.9% initial pass rate)
Demonstrates generalization across model types

Minimal Initial Harness:

Basic system prompt with role description
Standard tool access (bash, file operations, web search)
No model-specific optimizations
No domain-specific guidance

Performance Improvements

Model	Initial Pass Rate	Final Pass Rate	Absolute Gain	Relative Gain
MiniMax M2.5	40.5%	61.9%	+21.4%	+52.8%
Qwen3.5-35B-A3B	23.8%	38.1%	+14.3%	+60.1%
GLM-5	42.9%	57.1%	+14.2%	+33.1%

Key Findings:

1. Consistent Improvements Across All Models

All three models showed substantial gains (14-21 absolute points)
Improvements were not limited to high-performing models
Even the weakest baseline (Qwen 23.8%) saw 60% relative improvement

2. Model-Specific Harness Modifications

Different models generated different harness changes
MiniMax M2.5 focused on error recovery patterns
Qwen3.5 added more verification steps
GLM-5 improved context management

3. Non-Generic Improvements

Qualitative analysis showed changes were not just adding generic instructions
Each model identified unique weaknesses specific to its behavior
Proposals directly addressed concrete failure modes

4. Compound Benefits

Improvements accumulated across iterations
Later proposals built on earlier harness enhancements
Performance gains plateaued after 5-7 iterations

Iteration Dynamics

snippet

MiniMax M2.5 Improvement Curve:
Iteration 0 (Baseline):  40.5%
Iteration 1:             45.2%  (+4.7%)
Iteration 2:             51.8%  (+6.6%)
Iteration 3:             56.3%  (+4.5%)
Iteration 4:             59.1%  (+2.8%)
Iteration 5:             61.2%  (+2.1%)
Iteration 6:             61.9%  (+0.7%)
Iteration 7:             61.9%  (+0.0%)  [Converged]

Convergence Behavior:

Most gains in first 3-4 iterations
Diminishing returns after iteration 5
Stable performance indicates no overfitting

Qualitative Analysis: What Changed?

The paper provides detailed examples of discovered weaknesses and resulting harness modifications.

Example 1: Git Configuration Failures (MiniMax M2.5)

Weakness Mining Discovery:

snippet

Pattern: 8 failures on tasks requiring git commits
Root Cause: Missing git user.name and user.email configuration
Example Traces:
  - Task 23: "Create repo and commit changes" → FAIL (git commit rejected)
  - Task 45: "Initialize project with git" → FAIL (identity not configured)

Harness Proposal Generated:

diff

System Prompt Addition:
+ Git Configuration Prerequisite:
+ Before any git commit operation, verify configuration:
+ - Check: git config user.name
+ - Check: git config user.email
+ If either is unset, configure defaults:
+   git config user.name "Agent"
+   git config user.email "agent@localhost"

Validation Results:

8 previously failing tasks now pass
0 regressions on other tasks
✅ Accepted and merged

Impact: +9.0% pass rate improvement on git-related tasks

Example 2: File Verification Gaps (Qwen3.5-35B-A3B)

Weakness Mining Discovery:

snippet

Pattern: 12 failures on tasks involving file operations
Root Cause: Agent assumes file operations succeed without verification
Example Traces:
  - Task 12: Created config.json but didn't verify, later steps failed
  - Task 34: Assumed mkdir succeeded, then tried to cd into non-existent directory

Harness Proposal Generated:

diff

Tool Wrapper Addition:
def create_file(path, content):
    execute_bash(f"cat > {path} << 'EOF'\n{content}\nEOF")
+   # Verify file was created
+   if not execute_bash(f"test -f {path}"):
+       raise FileNotFoundError(f"Failed to create {path}")
+   # Verify content matches
+   actual = execute_bash(f"cat {path}")
+   if actual.strip() != content.strip():
+       raise ValueError(f"Content mismatch in {path}")

Validation Results:

10 of 12 failing tasks now pass
2 tasks showed different failures (unrelated to file verification)
0 regressions
✅ Accepted and merged

Impact: +11.2% pass rate improvement on file operation tasks

Example 3: Context Loss in Multi-Step Tasks (GLM-5)

Weakness Mining Discovery:

snippet

Pattern: 7 failures on long, multi-step tasks
Root Cause: Model loses track of intermediate results and task state
Example Traces:
  - Task 56: Forgot database connection string from step 2 by step 5
  - Task 78: Lost API key after environment setup, failed authentication

Harness Proposal Generated:

diff

Planning Template Update:
Plan for completing task:
+ [State Tracking]
+ - Track: {key variables to remember}
+ - Update tracker after each step completion
+
1. {step 1}
+   → Record outcome: {what to remember}
2. {step 2}
+   → Record outcome: {what to remember}
...
N. {final step}
+   → Verify: All tracked variables are still accessible

Validation Results:

6 of 7 failing tasks now pass
1 task failed due to unrelated timeout issue
0 regressions
✅ Accepted and merged

Impact: +6.7% pass rate improvement on multi-step tasks

Self-Harness vs. Human Harness Engineering

Aspect	Human Engineering	Self-Harness
Speed	Days to weeks per model	Hours (automated)
Scalability	Limited by human expertise	Scales with compute
Model-Specificity	Requires manual analysis	Automatically discovers patterns
Consistency	Varies by engineer skill	Systematic and reproducible
Cost	High (expert time)	Low (compute only)
Adaptation	Manual updates needed	Continuous self-improvement

When Human Engineering Still Matters:

✅ Initial harness architecture design
✅ Domain-specific tool selection
✅ Safety and compliance guardrails
✅ Production deployment decisions

Self-Harness vs. Stronger External Models

Some approaches use stronger models (e.g., GPT-5.5) to improve weaker agents. Self-Harness differs:

External Scaffolding Approach:

Uses GPT-5.5 to analyze GPT-4 agent failures
GPT-5.5 proposes fixes for GPT-4 harness
Requires access to superior model
Not self-contained

Self-Harness Approach:

Agent improves its own harness using its own capabilities
No external models required
Truly autonomous improvement
Scales to any model

Philosophical Difference:

"A model should be able to identify and fix its own systematic weaknesses, not rely on a smarter model to tell it what's wrong."

Self-Harness vs. Microsoft SkillOpt

Microsoft's SkillOpt also addresses self-improvement but focuses on skill refinement rather than harness optimization:

Feature	Self-Harness	SkillOpt
Target	Agent harness (system-level)	Individual skills (task-level)
Scope	Cross-task patterns	Single-skill optimization
Method	Trace analysis + proposals	Skill execution feedback
Validation	Regression testing	Skill-specific metrics
Granularity	System prompts, tool wrappers	Skill code and parameters

Complementary Approaches: Both can be used together—SkillOpt optimizes individual skills while Self-Harness improves the overarching framework.

Technical Deep Dive: How Self-Harness Works

Weakness Mining Algorithm

Input: Execution traces from failed and successful tasks

Output: Ranked list of weakness patterns with proposed fixes

Pseudo-code:

python

def mine_weaknesses(traces, model):
    failures = [t for t in traces if not t.success]

    # Group failures by similarity
    clusters = cluster_by_error_pattern(failures)

    weaknesses = []
    for cluster in clusters:
        # Analyze common failure mode
        pattern = model.analyze_pattern(cluster)

        # Extract root cause
        root_cause = model.identify_root_cause(pattern, cluster)

        # Count frequency and impact
        frequency = len(cluster)
        impacted_tasks = extract_task_ids(cluster)

        weaknesses.append(Weakness(
            pattern=pattern,
            root_cause=root_cause,
            frequency=frequency,
            impacted_tasks=impacted_tasks
        ))

    # Rank by frequency × impact
    return sorted(weaknesses, key=lambda w: w.frequency, reverse=True)

Key Techniques:

Error Clustering: Groups similar failures using embedding similarity
Pattern Extraction: Identifies recurring error types via LLM analysis
Root Cause Analysis: Traces failure back to harness gaps
Impact Assessment: Prioritizes high-frequency, high-impact weaknesses

Harness Proposal Generation

Input: A single weakness with context

Output: 3-5 diverse candidate harness modifications

Prompt Template:

snippet

You are analyzing your own execution failures to improve your harness.

Weakness Pattern:
{weakness.pattern}

Root Cause:
{weakness.root_cause}

Failed Task Examples:
{weakness.example_traces}

Current Harness:
{current_harness}

Generate 3-5 diverse, minimal harness modifications that would prevent this failure pattern.
For each proposal:
1. Describe the change
2. Explain why it addresses the root cause
3. Provide concrete implementation (system prompt, tool wrapper, or planning template)
4. Estimate impact on other tasks

Keep changes minimal and targeted. Avoid large rewrites.

Diversity Enforcement:

Prompt Variation: Different temperature and top-p settings per proposal
Approach Variety: Require proposals use different modification types
Semantic Distance: Reject proposals too similar to existing ones

Proposal Validation System

Input: Current harness, proposed harness, validation task set

Output: Accept/reject decision with detailed metrics

Validation Workflow:

python

def validate_proposal(current_harness, proposed_harness, val_tasks):
    # Baseline performance
    baseline_results = run_tasks(current_harness, val_tasks)
    baseline_pass_rate = compute_pass_rate(baseline_results)

    # Proposed performance
    proposal_results = run_tasks(proposed_harness, val_tasks)
    proposal_pass_rate = compute_pass_rate(proposal_results)

    # Check for regressions
    regressions = [
        task for task in val_tasks
        if baseline_results[task].passed and not proposal_results[task].passed
    ]

    # Check for improvements
    improvements = [
        task for task in val_tasks
        if not baseline_results[task].passed and proposal_results[task].passed
    ]

    # Decision criteria
    if len(regressions) > 0:
        return Decision.REJECT, "Introduced regressions"

    if proposal_pass_rate <= baseline_pass_rate:
        return Decision.REJECT, "No net improvement"

    if len(improvements) == 0:
        return Decision.REJECT, "No tasks improved"

    return Decision.ACCEPT, f"Improved {len(improvements)} tasks"

Regression Testing:

Full Re-evaluation: All validation tasks re-run with proposed harness
Strict No-Regression: Even one new failure causes rejection
Net Positive Requirement: Overall pass rate must increase
Weakness-Specific Improvement: At least one targeted failure must pass

Limitations and Future Directions

Current Limitations

1. Computational Cost

Each iteration requires running full benchmark suite twice (baseline + proposal)
89 tasks × 2 runs × 5 iterations = 890 agent runs
For expensive models, costs can accumulate ($50-100+ per full optimization)

2. Local Optima Risk

Greedy acceptance of proposals may miss global optimal harness
No backtracking if early proposals lead to dead ends
Might plateau before reaching human-expert-level harness

3. Benchmark Overfitting

Optimizing specifically for Terminal-Bench 2.0
May not generalize to other domains or task types
Held-out validation set mitigates but doesn't eliminate risk

4. Limited to Harness-Fixable Failures

Cannot improve base model capabilities
Some failures are fundamental model limitations
Only helps with failures caused by harness gaps

5. Minimal Harness Assumption

Starts from very basic harness
May not apply to already-optimized production harnesses
Unclear how it composes with existing harness engineering

Promising Future Directions

1. Cross-Model Harness Transfer

snippet

Question: Can harness improvements from Model A transfer to Model B?
Approach: Train Self-Harness on cheap model, transfer to expensive model
Potential: Reduce optimization cost for expensive frontier models

2. Multi-Benchmark Generalization

snippet

Question: Can harness optimize across multiple benchmarks simultaneously?
Approach: Validate proposals on Terminal-Bench 2.0 + SWE-bench + GAIA
Potential: More generalizable harnesses that work across domains

3. Compositional Harness Modules

snippet

Question: Can we build libraries of reusable harness modules?
Approach: Extract successful patterns into plug-and-play components
Potential: Faster initial harness setup for new models

4. Human-in-the-Loop Validation

snippet

Question: Can human review improve Self-Harness proposals?
Approach: Expert reviews edge cases and suggests refinements
Potential: Combine automation speed with human insight

5. Continuous Online Improvement

snippet

Question: Can Self-Harness improve during production deployment?
Approach: Mine weaknesses from real user interactions, propose fixes
Potential: Agents that continuously adapt to real-world usage patterns

Practical Implications

For AI Agent Developers

What This Means:

Faster Iteration: Spend less time manually tuning harnesses for new models
Model-Agnostic Optimization: Same framework works across GPT, Claude, Gemini, etc.
Data-Driven Improvements: Systematic analysis replaces trial-and-error
Reproducible Results: Automated process ensures consistent optimization

How to Apply:

Instrument Your Agent: Capture execution traces (tool calls, errors, outcomes)
Build Validation Suite: Create held-out test set for regression testing
Implement Self-Harness Loop: Adapt the three-stage framework to your agent
Monitor Convergence: Track pass rate improvements across iterations
Merge Improvements: Integrate validated proposals into production harness

For AI Researchers

Research Questions Opened:

How do Self-Harness improvements compare to meta-learning approaches?
Can we predict which models will benefit most from Self-Harness?
What is the theoretical limit of harness-only improvements?
How do self-improved harnesses generalize to out-of-distribution tasks?

Benchmark Implications:

Harness Standardization: Should benchmarks specify minimal standard harnesses?
Leaderboard Fairness: How to account for harness engineering in rankings?
Evaluation Protocols: Should we report both minimal-harness and optimized-harness scores?

For Organizations Deploying Agents

Strategic Considerations:

When to Use Self-Harness:

✅ Deploying agents on new models frequently
✅ Operating at scale where manual tuning doesn't scale
✅ Have diverse task types requiring different optimizations
✅ Want systematic, reproducible improvement process

When to Stick with Human Engineering:

⚠️ Small-scale deployments with stable models
⚠️ Highly domain-specific tasks requiring expert knowledge
⚠️ Strict safety/compliance requirements needing human review
⚠️ Limited compute budget for iterative optimization

Hybrid Approach:

snippet

1. Human experts design initial harness architecture
2. Self-Harness optimizes model-specific details
3. Humans review and approve proposed changes
4. Deploy optimized harness to production
5. Continuous Self-Harness monitoring for new failure patterns

Connection to Broader Trends

The Agent Harness Engineering Movement

Self-Harness builds on the emerging discipline of agent harness engineering, where differentiation comes from the scaffolding around the model, not just the model itself.

Timeline:

Feb 2026: Mitchell Hashimoto coins "harness engineering"
Mar 2026: LangChain reports +13.7% Terminal-Bench gain (harness-only)
May 2026: Stanford IRIS meta-harness research published
Jun 2026: Self-Harness demonstrates autonomous harness improvement

Philosophical Shift:

"Frontier models are table stakes. Differentiation is the harness—the loop, tools, middleware, and verification around the model."

Comparison to Loop Engineering

Loop engineering focuses on designing effective agent execution loops—the repeated cycle of planning, action, observation, and refinement.

Self-Harness Contribution:

Loop engineering defines the architecture
Self-Harness optimizes the details automatically
Together: Human-designed loops + AI-optimized harnesses

Terminal-Bench 2.0 as Standard Testbed

The choice of Terminal-Bench 2.0 as the evaluation benchmark is significant:

Why Terminal-Bench 2.0 Works for Self-Harness:

✅ Diverse Tasks: 89 tasks across ML, systems, security, biology
✅ Real-World: Inspired by actual developer/sysadmin workflows
✅ Reproducible: Containerized environments ensure consistency
✅ Industry Standard: Used by virtually all frontier labs

Benchmark Scores Context:

Top agent+model combinations: ~80-82% (ForgeCode, TongAgents)
Top direct models: ~73% (GPT-5.5)
Self-Harness improved models: 38-62% (from minimal harness)
Gap shows headroom for further harness engineering

How to Access and Reproduce

Paper and Code

Paper:

arXiv: arXiv:2606.09498
PDF: arxiv.org/pdf/2606.09498
Published: June 8, 2026

Authors:

Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang
Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu

Code Availability:

Paper mentions code release but URL not yet in arXiv submission
Check authors' GitHub repositories or paper homepage for updates

Reproduction Guide

Prerequisites:

bash

# Terminal-Bench 2.0 setup
git clone https://github.com/laude-institute/terminal-bench-2.0
cd terminal-bench-2.0
pip install -r requirements.txt

# Base model access (choose one)
# - MiniMax M2.5 API
# - Qwen3.5-35B-A3B (via ollama or API)
# - GLM-5 API

# Harbor framework
pip install harbor-agents

Running Self-Harness:

python

from self_harness import SelfHarnessOptimizer
from terminal_bench import load_benchmark

# Load benchmark
tasks = load_benchmark("terminal-bench-2.0")
train_tasks, val_tasks = split_tasks(tasks, ratio=0.7)

# Initialize with minimal harness
minimal_harness = MinimalHarness(
    model="minimax-m2.5",
    system_prompt="You are an AI agent with access to terminal commands."
)

# Run Self-Harness optimization
optimizer = SelfHarnessOptimizer(
    base_harness=minimal_harness,
    max_iterations=10,
    validation_tasks=val_tasks
)

optimized_harness = optimizer.optimize(train_tasks)

# Evaluate
baseline_score = evaluate(minimal_harness, val_tasks)
optimized_score = evaluate(optimized_harness, val_tasks)

print(f"Baseline: {baseline_score:.1%}")
print(f"Optimized: {optimized_score:.1%}")
print(f"Gain: +{optimized_score - baseline_score:.1%}")

Expected Results (MiniMax M2.5):

snippet

Baseline:  40.5%
Optimized: 61.9%
Gain:     +21.4%

Sources and References

Primary Source

Paper:

Title: Self-Harness: Harnesses That Improve Themselves
Authors: Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu
Published: June 8, 2026 (arXiv)
arXiv ID: 2606.09498 [cs.CL]
DOI: https://doi.org/10.48550/arXiv.2606.09498

Terminal-Bench 2.0 Paper — The benchmark used for evaluation
Stanford IRIS Meta-Harness — Related harness optimization research
Harbor Framework — Agent evaluation framework

Self-Harness was published on arXiv on June 8, 2026, introducing a paradigm where LLM-based agents autonomously improve their own operating harnesses through weakness mining, harness proposals, and validation—achieving substantial performance gains on Terminal-Bench 2.0 across diverse base models without requiring human engineers or stronger external models.

Related posts

What Is Self-Harness? The AI Agent Pattern That Improves Its Own Scaffolding

Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters

LM Studio Bionic: Open-Model Agent for Code and Work Projects

The Harness Engineering Problem

The Scaling Problem with Human-Designed Harnesses

Introducing Self-Harness: Agents That Fix Themselves

The Self-Harness Paradigm

The Three-Stage Self-Harness Loop

Architecture Overview

Stage 1: Weakness Mining

Stage 2: Harness Proposal

Stage 3: Proposal Validation

Experimental Results: Terminal-Bench 2.0

Experimental Setup

Performance Improvements

Iteration Dynamics

Qualitative Analysis: What Changed?

Example 1: Git Configuration Failures (MiniMax M2.5)

Example 2: File Verification Gaps (Qwen3.5-35B-A3B)

Example 3: Context Loss in Multi-Step Tasks (GLM-5)

Comparison to Related Approaches

Self-Harness vs. Human Harness Engineering

Self-Harness vs. Stronger External Models

Self-Harness vs. Microsoft SkillOpt

Technical Deep Dive: How Self-Harness Works

Weakness Mining Algorithm

Harness Proposal Generation

Proposal Validation System

Limitations and Future Directions

Current Limitations

Promising Future Directions

Practical Implications

For AI Agent Developers

For AI Researchers

For Organizations Deploying Agents

Connection to Broader Trends

The Agent Harness Engineering Movement

Comparison to Loop Engineering

Terminal-Bench 2.0 as Standard Testbed

How to Access and Reproduce

Paper and Code

Reproduction Guide

Sources and References

Primary Source

Related Research

Related Reading