TL;DR: Published June 8, 2026 on arXiv, "Self-Harness: Harnesses That Improve Themselves" introduces a paradigm where LLM-based agents autonomously optimize their own operating frameworks without human engineers or stronger external models. Using a three-stage loop (Weakness Mining, Harness Proposal, Proposal Validation), Self-Harness achieved consistent performance improvements on Terminal-Bench 2.0: MiniMax M2.5 improved from 40.5% to 61.9% (+52.6%), Qwen3.5-35B-A3B from 23.8% to 38.1% (+60.1%), and GLM-5 from 42.9% to 57.1% (+33.1%)—demonstrating that agents can effectively turn model-specific weaknesses into concrete, executable harness improvements.
The Harness Engineering Problem
The performance of LLM-based agents is jointly shaped by two critical factors:
- Base Model Capabilities — The underlying LLM's reasoning, coding, and knowledge
- Agent Harness — The scaffolding that mediates interaction with the environment
While much attention focuses on improving base models, recent evidence shows that harness engineering can yield 10-15 point improvements on benchmarks while keeping the base model fixed.
The Scaling Problem with Human-Designed Harnesses
Current State:
- Harnesses are largely engineered by human experts
- Effective harness design is inherently model-specific
- Different models exhibit distinct failure patterns and behaviors
- Modern LLMs are increasingly diverse and rapidly evolving
Why This Doesn't Scale:
graph TD
A[New Model Released] --> B[Human Engineers Analyze]
B --> C[Design Model-Specific Harness]
C --> D[Manual Testing & Iteration]
D --> E{Performance OK?}
E -->|No| B
E -->|Yes| F[Deploy]
G[Another Model Released] --> B
style B fill:#ff6b6b
style C fill:#ff6b6b
style D fill:#ff6b6b
The Bottleneck: Human engineers can't keep pace with model diversity and evolution. Each new model family (GPT, Claude, Gemini, Qwen, GLM, MiniMax, etc.) exhibits unique behaviors requiring custom harness design.
Introducing Self-Harness: Agents That Fix Themselves
Core Innovation: What if the agent could improve its own harness, without relying on human engineers or stronger external models?
The Self-Harness Paradigm
Definition: Self-Harness is an iterative framework where an LLM-based agent autonomously:
- Identifies its own failure patterns through execution trace analysis
- Proposes targeted modifications to its operating harness
- Validates improvements through regression testing
- Iterates until performance converges
Key Insight: Instead of human experts manually engineering model-specific fixes, the model itself discovers and implements what it needs to succeed.
Complete AI Builder Bootcamp
Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.
The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.
The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.
Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.
The Three-Stage Self-Harness Loop
Architecture Overview
graph LR
A[Execution Traces] --> B[Stage 1: Weakness Mining]
B --> C[Identified Failure Patterns]
C --> D[Stage 2: Harness Proposal]
D --> E[Candidate Harness Modifications]
E --> F[Stage 3: Proposal Validation]
F --> G{Regression Tests Pass?}
G -->|Yes| H[Accept Changes]
G -->|No| I[Reject Changes]
H --> J[Updated Harness]
I --> D
J --> K[Run Benchmark Tasks]
K --> A
Stage 1: Weakness Mining
Purpose: Identify model-specific failure patterns from execution traces.
Process:
-
Collect Execution Traces — Run the agent on benchmark tasks, capturing:
- Tool calls made
- Responses received
- Reasoning steps
- Terminal output
- Error messages
- Success/failure status
-
Pattern Analysis — The agent analyzes its own traces to discover:
- Recurring error types
- Missing error handling
- Inefficient tool usage patterns
- Context management failures
- Planning mistakes
- Stuck loops or infinite retries
-
Failure Categorization — Weaknesses are grouped by type:
- Tool Selection Errors: Wrong tool for the task
- Context Errors: Lost or misinterpreted information
- Planning Errors: Poor task decomposition
- Error Handling: Failed to recover from failures
- Verification Gaps: Didn't validate assumptions
Example Weakness Discovery:
Weakness ID: W-042
Pattern: Agent frequently fails git operations by forgetting to configure user.name
Frequency: 12 failures across 89 tasks
Impact: Blocks commit-related tasks
Category: Tool prerequisite missing
Stage 2: Harness Proposal
Purpose: Generate diverse yet minimal harness modifications tied to discovered weaknesses.
Design Principles:
- ✅ Targeted: Each proposal addresses a specific weakness
- ✅ Minimal: Small, focused changes rather than large rewrites
- ✅ Diverse: Generate multiple candidate solutions per weakness
- ✅ Testable: Changes must be verifiable through benchmark tasks
Proposal Types:
1. System Prompt Modifications
# Before
You are an AI agent with access to terminal commands.
# After (Self-Harness Proposal)
You are an AI agent with access to terminal commands.
+ Before running git commit, always verify git user.name and user.email are configured.
+ If not set, configure them using: git config user.name "Agent" && git config user.email "agent@localhost"
2. Tool Wrapper Additions
# Self-Harness proposes wrapping git commands
def execute_git_command(cmd):
# Ensure git is configured before any commit operation
if "commit" in cmd:
check_git_config()
return subprocess.run(cmd, shell=True)
3. Validation Step Injection
# Self-Harness proposes adding verification after file operations
def create_file(path, content):
write_file(path, content)
# Validate file was created successfully
if not os.path.exists(path):
raise FileNotFoundError(f"Failed to create {path}")
# Validate content matches
if read_file(path) != content:
raise ValueError("File content mismatch")
4. Planning Template Updates
# Before
Plan: {steps}
# After (Self-Harness Proposal)
Plan:
+ 1. Verify prerequisites (dependencies, configs, permissions)
{steps}
+ N+1. Verify expected outcomes
+ N+2. Clean up temporary resources
Diversity Mechanism: For each weakness, Self-Harness generates 3-5 candidate proposals using different approaches:
- Prompt-based guidance
- Tool wrapper modifications
- Validation additions
- Planning template changes
- Error recovery procedures
Stage 3: Proposal Validation
Purpose: Accept candidate edits only after regression testing to prevent breaking existing capabilities.
Validation Pipeline:
1. Held-Out Test Set
Training Set: 70% of Terminal-Bench 2.0 tasks (62 tasks)
Validation Set: 30% held-out tasks (27 tasks)
2. Regression Testing
def validate_proposal(current_harness, proposed_harness, tasks):
baseline_results = run_benchmark(current_harness, tasks)
proposal_results = run_benchmark(proposed_harness, tasks)
# Accept only if:
# 1. No regression on previously passing tasks
# 2. Net improvement in pass rate
return (
no_regression(baseline_results, proposal_results) and
net_improvement(baseline_results, proposal_results)
)
3. Acceptance Criteria
A proposal is accepted if:
- ✅ No Breaking Changes: All previously passing tasks still pass
- ✅ Net Positive: Overall pass rate improves
- ✅ Weakness Addressed: At least one failure from the target weakness now passes
- ✅ No New Failures: Doesn't introduce failures in unrelated tasks
4. Iterative Application
Once validated, the proposal is:
- Merged into the harness
- Used for subsequent weakness mining
- Compound improvements build on previous fixes
Safety Mechanism:
# Only minimal, targeted changes are accepted
if change_diff_lines > MAX_CHANGE_SIZE:
reject_proposal("Too large, split into smaller changes")
Experimental Results: Terminal-Bench 2.0
The paper evaluated Self-Harness on Terminal-Bench 2.0, the industry-standard benchmark for AI agent evaluation comprising 89 carefully curated tasks across diverse domains.
Experimental Setup
Base Models Tested:
- MiniMax M2.5 — Chinese frontier model from MiniMax
- Qwen3.5-35B-A3B — Open-weight model from Alibaba Cloud
- GLM-5 — Bilingual model from Zhipu AI
Why These Models?
- Diverse model families (not all OpenAI/Anthropic)
- Different architectures and training approaches
- Varying baseline capabilities (23.8% to 42.9% initial pass rate)
- Demonstrates generalization across model types
Minimal Initial Harness:
- Basic system prompt with role description
- Standard tool access (bash, file operations, web search)
- No model-specific optimizations
- No domain-specific guidance
Performance Improvements
| Model | Initial Pass Rate | Final Pass Rate | Absolute Gain | Relative Gain |
|---|---|---|---|---|
| MiniMax M2.5 | 40.5% | 61.9% | +21.4% | +52.8% |
| Qwen3.5-35B-A3B | 23.8% | 38.1% | +14.3% | +60.1% |
| GLM-5 | 42.9% | 57.1% | +14.2% | +33.1% |
Key Findings:
1. Consistent Improvements Across All Models
- All three models showed substantial gains (14-21 absolute points)
- Improvements were not limited to high-performing models
- Even the weakest baseline (Qwen 23.8%) saw 60% relative improvement
2. Model-Specific Harness Modifications
- Different models generated different harness changes
- MiniMax M2.5 focused on error recovery patterns
- Qwen3.5 added more verification steps
- GLM-5 improved context management
3. Non-Generic Improvements
- Qualitative analysis showed changes were not just adding generic instructions
- Each model identified unique weaknesses specific to its behavior
- Proposals directly addressed concrete failure modes
4. Compound Benefits
- Improvements accumulated across iterations
- Later proposals built on earlier harness enhancements
- Performance gains plateaued after 5-7 iterations
Iteration Dynamics
MiniMax M2.5 Improvement Curve:
Iteration 0 (Baseline): 40.5%
Iteration 1: 45.2% (+4.7%)
Iteration 2: 51.8% (+6.6%)
Iteration 3: 56.3% (+4.5%)
Iteration 4: 59.1% (+2.8%)
Iteration 5: 61.2% (+2.1%)
Iteration 6: 61.9% (+0.7%)
Iteration 7: 61.9% (+0.0%) [Converged]
Convergence Behavior:
- Most gains in first 3-4 iterations
- Diminishing returns after iteration 5
- Stable performance indicates no overfitting
Qualitative Analysis: What Changed?
The paper provides detailed examples of discovered weaknesses and resulting harness modifications.
Example 1: Git Configuration Failures (MiniMax M2.5)
Weakness Mining Discovery:
Pattern: 8 failures on tasks requiring git commits
Root Cause: Missing git user.name and user.email configuration
Example Traces:
- Task 23: "Create repo and commit changes" → FAIL (git commit rejected)
- Task 45: "Initialize project with git" → FAIL (identity not configured)
Harness Proposal Generated:
System Prompt Addition:
+ Git Configuration Prerequisite:
+ Before any git commit operation, verify configuration:
+ - Check: git config user.name
+ - Check: git config user.email
+ If either is unset, configure defaults:
+ git config user.name "Agent"
+ git config user.email "agent@localhost"
Validation Results:
- 8 previously failing tasks now pass
- 0 regressions on other tasks
- ✅ Accepted and merged
Impact: +9.0% pass rate improvement on git-related tasks
Example 2: File Verification Gaps (Qwen3.5-35B-A3B)
Weakness Mining Discovery:
Pattern: 12 failures on tasks involving file operations
Root Cause: Agent assumes file operations succeed without verification
Example Traces:
- Task 12: Created config.json but didn't verify, later steps failed
- Task 34: Assumed mkdir succeeded, then tried to cd into non-existent directory
Harness Proposal Generated:
Tool Wrapper Addition:
def create_file(path, content):
execute_bash(f"cat > {path} << 'EOF'\n{content}\nEOF")
+ # Verify file was created
+ if not execute_bash(f"test -f {path}"):
+ raise FileNotFoundError(f"Failed to create {path}")
+ # Verify content matches
+ actual = execute_bash(f"cat {path}")
+ if actual.strip() != content.strip():
+ raise ValueError(f"Content mismatch in {path}")
Validation Results:
- 10 of 12 failing tasks now pass
- 2 tasks showed different failures (unrelated to file verification)
- 0 regressions
- ✅ Accepted and merged
Impact: +11.2% pass rate improvement on file operation tasks
Example 3: Context Loss in Multi-Step Tasks (GLM-5)
Weakness Mining Discovery:
Pattern: 7 failures on long, multi-step tasks
Root Cause: Model loses track of intermediate results and task state
Example Traces:
- Task 56: Forgot database connection string from step 2 by step 5
- Task 78: Lost API key after environment setup, failed authentication
Harness Proposal Generated:
Planning Template Update:
Plan for completing task:
+ [State Tracking]
+ - Track: {key variables to remember}
+ - Update tracker after each step completion
+
1. {step 1}
+ → Record outcome: {what to remember}
2. {step 2}
+ → Record outcome: {what to remember}
...
N. {final step}
+ → Verify: All tracked variables are still accessible
Validation Results:
- 6 of 7 failing tasks now pass
- 1 task failed due to unrelated timeout issue
- 0 regressions
- ✅ Accepted and merged
Impact: +6.7% pass rate improvement on multi-step tasks
Comparison to Related Approaches
Self-Harness vs. Human Harness Engineering
| Aspect | Human Engineering | Self-Harness |
|---|---|---|
| Speed | Days to weeks per model | Hours (automated) |
| Scalability | Limited by human expertise | Scales with compute |
| Model-Specificity | Requires manual analysis | Automatically discovers patterns |
| Consistency | Varies by engineer skill | Systematic and reproducible |
| Cost | High (expert time) | Low (compute only) |
| Adaptation | Manual updates needed | Continuous self-improvement |
When Human Engineering Still Matters:
- ✅ Initial harness architecture design
- ✅ Domain-specific tool selection
- ✅ Safety and compliance guardrails
- ✅ Production deployment decisions
Self-Harness vs. Stronger External Models
Some approaches use stronger models (e.g., GPT-5.5) to improve weaker agents. Self-Harness differs:
External Scaffolding Approach:
- Uses GPT-5.5 to analyze GPT-4 agent failures
- GPT-5.5 proposes fixes for GPT-4 harness
- Requires access to superior model
- Not self-contained
Self-Harness Approach:
- Agent improves its own harness using its own capabilities
- No external models required
- Truly autonomous improvement
- Scales to any model
Philosophical Difference:
"A model should be able to identify and fix its own systematic weaknesses, not rely on a smarter model to tell it what's wrong."
Self-Harness vs. Microsoft SkillOpt
Microsoft's SkillOpt also addresses self-improvement but focuses on skill refinement rather than harness optimization:
| Feature | Self-Harness | SkillOpt |
|---|---|---|
| Target | Agent harness (system-level) | Individual skills (task-level) |
| Scope | Cross-task patterns | Single-skill optimization |
| Method | Trace analysis + proposals | Skill execution feedback |
| Validation | Regression testing | Skill-specific metrics |
| Granularity | System prompts, tool wrappers | Skill code and parameters |
Complementary Approaches: Both can be used together—SkillOpt optimizes individual skills while Self-Harness improves the overarching framework.
Technical Deep Dive: How Self-Harness Works
Weakness Mining Algorithm
Input: Execution traces from failed and successful tasks
Output: Ranked list of weakness patterns with proposed fixes
Pseudo-code:
def mine_weaknesses(traces, model):
failures = [t for t in traces if not t.success]
# Group failures by similarity
clusters = cluster_by_error_pattern(failures)
weaknesses = []
for cluster in clusters:
# Analyze common failure mode
pattern = model.analyze_pattern(cluster)
# Extract root cause
root_cause = model.identify_root_cause(pattern, cluster)
# Count frequency and impact
frequency = len(cluster)
impacted_tasks = extract_task_ids(cluster)
weaknesses.append(Weakness(
pattern=pattern,
root_cause=root_cause,
frequency=frequency,
impacted_tasks=impacted_tasks
))
# Rank by frequency × impact
return sorted(weaknesses, key=lambda w: w.frequency, reverse=True)
Key Techniques:
- Error Clustering: Groups similar failures using embedding similarity
- Pattern Extraction: Identifies recurring error types via LLM analysis
- Root Cause Analysis: Traces failure back to harness gaps
- Impact Assessment: Prioritizes high-frequency, high-impact weaknesses
Harness Proposal Generation
Input: A single weakness with context
Output: 3-5 diverse candidate harness modifications
Prompt Template:
You are analyzing your own execution failures to improve your harness.
Weakness Pattern:
{weakness.pattern}
Root Cause:
{weakness.root_cause}
Failed Task Examples:
{weakness.example_traces}
Current Harness:
{current_harness}
Generate 3-5 diverse, minimal harness modifications that would prevent this failure pattern.
For each proposal:
1. Describe the change
2. Explain why it addresses the root cause
3. Provide concrete implementation (system prompt, tool wrapper, or planning template)
4. Estimate impact on other tasks
Keep changes minimal and targeted. Avoid large rewrites.
Diversity Enforcement:
- Prompt Variation: Different temperature and top-p settings per proposal
- Approach Variety: Require proposals use different modification types
- Semantic Distance: Reject proposals too similar to existing ones
Proposal Validation System
Input: Current harness, proposed harness, validation task set
Output: Accept/reject decision with detailed metrics
Validation Workflow:
def validate_proposal(current_harness, proposed_harness, val_tasks):
# Baseline performance
baseline_results = run_tasks(current_harness, val_tasks)
baseline_pass_rate = compute_pass_rate(baseline_results)
# Proposed performance
proposal_results = run_tasks(proposed_harness, val_tasks)
proposal_pass_rate = compute_pass_rate(proposal_results)
# Check for regressions
regressions = [
task for task in val_tasks
if baseline_results[task].passed and not proposal_results[task].passed
]
# Check for improvements
improvements = [
task for task in val_tasks
if not baseline_results[task].passed and proposal_results[task].passed
]
# Decision criteria
if len(regressions) > 0:
return Decision.REJECT, "Introduced regressions"
if proposal_pass_rate <= baseline_pass_rate:
return Decision.REJECT, "No net improvement"
if len(improvements) == 0:
return Decision.REJECT, "No tasks improved"
return Decision.ACCEPT, f"Improved {len(improvements)} tasks"
Regression Testing:
- Full Re-evaluation: All validation tasks re-run with proposed harness
- Strict No-Regression: Even one new failure causes rejection
- Net Positive Requirement: Overall pass rate must increase
- Weakness-Specific Improvement: At least one targeted failure must pass
Limitations and Future Directions
Current Limitations
1. Computational Cost
- Each iteration requires running full benchmark suite twice (baseline + proposal)
- 89 tasks × 2 runs × 5 iterations = 890 agent runs
- For expensive models, costs can accumulate ($50-100+ per full optimization)
2. Local Optima Risk
- Greedy acceptance of proposals may miss global optimal harness
- No backtracking if early proposals lead to dead ends
- Might plateau before reaching human-expert-level harness
3. Benchmark Overfitting
- Optimizing specifically for Terminal-Bench 2.0
- May not generalize to other domains or task types
- Held-out validation set mitigates but doesn't eliminate risk
4. Limited to Harness-Fixable Failures
- Cannot improve base model capabilities
- Some failures are fundamental model limitations
- Only helps with failures caused by harness gaps
5. Minimal Harness Assumption
- Starts from very basic harness
- May not apply to already-optimized production harnesses
- Unclear how it composes with existing harness engineering
Promising Future Directions
1. Cross-Model Harness Transfer
Question: Can harness improvements from Model A transfer to Model B?
Approach: Train Self-Harness on cheap model, transfer to expensive model
Potential: Reduce optimization cost for expensive frontier models
2. Multi-Benchmark Generalization
Question: Can harness optimize across multiple benchmarks simultaneously?
Approach: Validate proposals on Terminal-Bench 2.0 + SWE-bench + GAIA
Potential: More generalizable harnesses that work across domains
3. Compositional Harness Modules
Question: Can we build libraries of reusable harness modules?
Approach: Extract successful patterns into plug-and-play components
Potential: Faster initial harness setup for new models
4. Human-in-the-Loop Validation
Question: Can human review improve Self-Harness proposals?
Approach: Expert reviews edge cases and suggests refinements
Potential: Combine automation speed with human insight
5. Continuous Online Improvement
Question: Can Self-Harness improve during production deployment?
Approach: Mine weaknesses from real user interactions, propose fixes
Potential: Agents that continuously adapt to real-world usage patterns
Practical Implications
For AI Agent Developers
What This Means:
- Faster Iteration: Spend less time manually tuning harnesses for new models
- Model-Agnostic Optimization: Same framework works across GPT, Claude, Gemini, etc.
- Data-Driven Improvements: Systematic analysis replaces trial-and-error
- Reproducible Results: Automated process ensures consistent optimization
How to Apply:
- Instrument Your Agent: Capture execution traces (tool calls, errors, outcomes)
- Build Validation Suite: Create held-out test set for regression testing
- Implement Self-Harness Loop: Adapt the three-stage framework to your agent
- Monitor Convergence: Track pass rate improvements across iterations
- Merge Improvements: Integrate validated proposals into production harness
For AI Researchers
Research Questions Opened:
- How do Self-Harness improvements compare to meta-learning approaches?
- Can we predict which models will benefit most from Self-Harness?
- What is the theoretical limit of harness-only improvements?
- How do self-improved harnesses generalize to out-of-distribution tasks?
Benchmark Implications:
- Harness Standardization: Should benchmarks specify minimal standard harnesses?
- Leaderboard Fairness: How to account for harness engineering in rankings?
- Evaluation Protocols: Should we report both minimal-harness and optimized-harness scores?
For Organizations Deploying Agents
Strategic Considerations:
When to Use Self-Harness:
- ✅ Deploying agents on new models frequently
- ✅ Operating at scale where manual tuning doesn't scale
- ✅ Have diverse task types requiring different optimizations
- ✅ Want systematic, reproducible improvement process
When to Stick with Human Engineering:
- ⚠️ Small-scale deployments with stable models
- ⚠️ Highly domain-specific tasks requiring expert knowledge
- ⚠️ Strict safety/compliance requirements needing human review
- ⚠️ Limited compute budget for iterative optimization
Hybrid Approach:
1. Human experts design initial harness architecture
2. Self-Harness optimizes model-specific details
3. Humans review and approve proposed changes
4. Deploy optimized harness to production
5. Continuous Self-Harness monitoring for new failure patterns
Connection to Broader Trends
The Agent Harness Engineering Movement
Self-Harness builds on the emerging discipline of agent harness engineering, where differentiation comes from the scaffolding around the model, not just the model itself.
Timeline:
- Feb 2026: Mitchell Hashimoto coins "harness engineering"
- Mar 2026: LangChain reports +13.7% Terminal-Bench gain (harness-only)
- May 2026: Stanford IRIS meta-harness research published
- Jun 2026: Self-Harness demonstrates autonomous harness improvement
Philosophical Shift:
"Frontier models are table stakes. Differentiation is the harness—the loop, tools, middleware, and verification around the model."
Comparison to Loop Engineering
Loop engineering focuses on designing effective agent execution loops—the repeated cycle of planning, action, observation, and refinement.
Self-Harness Contribution:
- Loop engineering defines the architecture
- Self-Harness optimizes the details automatically
- Together: Human-designed loops + AI-optimized harnesses
Terminal-Bench 2.0 as Standard Testbed
The choice of Terminal-Bench 2.0 as the evaluation benchmark is significant:
Why Terminal-Bench 2.0 Works for Self-Harness:
- ✅ Diverse Tasks: 89 tasks across ML, systems, security, biology
- ✅ Real-World: Inspired by actual developer/sysadmin workflows
- ✅ Reproducible: Containerized environments ensure consistency
- ✅ Industry Standard: Used by virtually all frontier labs
Benchmark Scores Context:
- Top agent+model combinations: ~80-82% (ForgeCode, TongAgents)
- Top direct models: ~73% (GPT-5.5)
- Self-Harness improved models: 38-62% (from minimal harness)
- Gap shows headroom for further harness engineering
How to Access and Reproduce
Paper and Code
Paper:
- arXiv: arXiv:2606.09498
- PDF: arxiv.org/pdf/2606.09498
- Published: June 8, 2026
Authors:
- Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang
- Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu
Code Availability:
- Paper mentions code release but URL not yet in arXiv submission
- Check authors' GitHub repositories or paper homepage for updates
Reproduction Guide
Prerequisites:
# Terminal-Bench 2.0 setup
git clone https://github.com/laude-institute/terminal-bench-2.0
cd terminal-bench-2.0
pip install -r requirements.txt
# Base model access (choose one)
# - MiniMax M2.5 API
# - Qwen3.5-35B-A3B (via ollama or API)
# - GLM-5 API
# Harbor framework
pip install harbor-agents
Running Self-Harness:
from self_harness import SelfHarnessOptimizer
from terminal_bench import load_benchmark
# Load benchmark
tasks = load_benchmark("terminal-bench-2.0")
train_tasks, val_tasks = split_tasks(tasks, ratio=0.7)
# Initialize with minimal harness
minimal_harness = MinimalHarness(
model="minimax-m2.5",
system_prompt="You are an AI agent with access to terminal commands."
)
# Run Self-Harness optimization
optimizer = SelfHarnessOptimizer(
base_harness=minimal_harness,
max_iterations=10,
validation_tasks=val_tasks
)
optimized_harness = optimizer.optimize(train_tasks)
# Evaluate
baseline_score = evaluate(minimal_harness, val_tasks)
optimized_score = evaluate(optimized_harness, val_tasks)
print(f"Baseline: {baseline_score:.1%}")
print(f"Optimized: {optimized_score:.1%}")
print(f"Gain: +{optimized_score - baseline_score:.1%}")
Expected Results (MiniMax M2.5):
Baseline: 40.5%
Optimized: 61.9%
Gain: +21.4%
Sources and References
Primary Source
Paper:
- Title: Self-Harness: Harnesses That Improve Themselves
- Authors: Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu
- Published: June 8, 2026 (arXiv)
- arXiv ID: 2606.09498 [cs.CL]
- DOI: https://doi.org/10.48550/arXiv.2606.09498
Related Research
- Terminal-Bench 2.0 Paper — The benchmark used for evaluation
- Stanford IRIS Meta-Harness — Related harness optimization research
- Harbor Framework — Agent evaluation framework
Related Reading
- Terminal-Bench 2.0: The AI Agent Benchmark That Actually Matters
- Agent Harness Engineering: When the Model Stays Fixed
- Loop Engineering: Coding Agents and Claude Code Guide
- Microsoft SkillOpt: Self-Improving Agent Skills
- Anthropic Engineer Loops and Harness Engineering
- Claude Fable 5 and Mythos 5: SOTA Autonomy
Self-Harness was published on arXiv on June 8, 2026, introducing a paradigm where LLM-based agents autonomously improve their own operating harnesses through weakness mining, harness proposals, and validation—achieving substantial performance gains on Terminal-Bench 2.0 across diverse base models without requiring human engineers or stronger external models.