Self-Harness: AI Agents That Improve Their Own Operating Framework
Self-Harness introduces a new paradigm where LLM-based agents autonomously improve their own harnesses without human engineers or stronger external models, achieving 15-52% performance gains on Terminal-Bench 2.0.
TL;DR: Published June 8, 2026 on arXiv, "Self-Harness: Harnesses That Improve Themselves" introduces a paradigm where LLM-based agents autonomously optimize their own operating frameworks without human engineers or stronger external models. Using a three-stage loop (Weakness Mining, Harness Proposal, Proposal Validation), Self-Harness achieved consistent performance improvements on Terminal-Bench 2.0: MiniMax M2.5 improved from 40.5% to 61.9% (+52.6%), Qwen3.5-35B-A3B from 23.8% to 38.1% (+60.1%), and GLM-5 from 42.9% to 57.1% (+33.1%)—demonstrating that agents can effectively turn model-specific weaknesses into concrete, executable harness improvements.
The Harness Engineering Problem
The performance of LLM-based agents is jointly shaped by two critical factors:
Base Model Capabilities — The underlying LLM's reasoning, coding, and knowledge
Agent Harness — The scaffolding that mediates interaction with the environment
While much attention focuses on improving base models, recent evidence shows that harness engineering can yield 10-15 point improvements on benchmarks while keeping the base model fixed.
Learn more about loop engineering
The Scaling Problem with Human-Designed Harnesses
Current State:
Harnesses are largely engineered by human experts
Effective harness design is inherently model-specific
Different models exhibit distinct failure patterns and behaviors
Modern LLMs are increasingly diverse and rapidly evolving
Why This Doesn't Scale:
graph TD
A[New Model Released] --> B[Human Engineers Analyze]
B --> C[Design Model-Specific Harness]
C --> D[Manual Testing & Iteration]
D --> E{Performance OK?}
E -->|No| B
E -->|Yes| F[Deploy]
G[Another Model Released] --> B
style B fill:#ff6b6b
style C fill:#ff6b6b
style D fill:#ff6b6b
The Bottleneck: Human engineers can't keep pace with model diversity and evolution. Each new model family (GPT, Claude, Gemini, Qwen, GLM, MiniMax, etc.) exhibits unique behaviors requiring custom harness design.
Introducing Self-Harness: Agents That Fix Themselves
Core Innovation: What if the agent could improve its own harness, without relying on human engineers or stronger external models?
The Self-Harness Paradigm
Definition: Self-Harness is an iterative framework where an LLM-based agent autonomously:
Identifies its own failure patterns through execution trace analysis
Proposes targeted modifications to its operating harness
Validates improvements through regression testing
Iterates until performance converges
Key Insight: Instead of human experts manually engineering model-specific fixes, the model itself discovers and implements what it needs to succeed.
The Three-Stage Self-Harness Loop
Architecture Overview
graph LR
A[Execution Traces] --> B[Stage 1: Weakness Mining]
B --> C[Identified Failure Patterns]
C --> D[Stage 2: Harness Proposal]
D --> E[Candidate Harness Modifications]
E --> F[Stage 3: Proposal Validation]
F --> G{Regression Tests Pass?}
G -->|Yes| H[Accept Changes]
G -->|No| I[Reject Changes]
H --> J[Updated Harness]
I --> D
J --> K[Run Benchmark Tasks]
K --> A
Stage 1: Weakness Mining
Purpose: Identify model-specific failure patterns from execution traces.
Process:
Collect Execution Traces — Run the agent on benchmark tasks, capturing:
Tool calls made
Responses received
Reasoning steps
Terminal output
Error messages
Success/failure status
Pattern Analysis — The agent analyzes its own traces to discover:
Recurring error types
Missing error handling
Inefficient tool usage patterns
Context management failures
Planning mistakes
Stuck loops or infinite retries
Failure Categorization — Weaknesses are grouped by type:
Tool Selection Errors: Wrong tool for the task
Context Errors: Lost or misinterpreted information
Planning Errors: Poor task decomposition
Error Handling: Failed to recover from failures
Verification Gaps: Didn't validate assumptions
Example Weakness Discovery:
Weakness ID: W-042
Pattern: Agent frequently fails git operations by forgetting to configure user.name
Frequency: 12 failures across 89 tasks
Impact: Blocks commit-related tasks
Category: Tool prerequisite missing
Stage 2: Harness Proposal
Purpose: Generate diverse yet minimal harness modifications tied to discovered weaknesses.
Design Principles:
✅ Targeted: Each proposal addresses a specific weakness
✅ Minimal: Small, focused changes rather than large rewrites
✅ Diverse: Generate multiple candidate solutions per weakness
✅ Testable: Changes must be verifiable through benchmark tasks
Proposal Types:
1. System Prompt Modifications
# Before
You are an AI agent with access to terminal commands.
# After (Self-Harness Proposal)
You are an AI agent with access to terminal commands.
+ Before running git commit, always verify git user.name and user.email are configured.+ If not set, configure them using: git config user.name "Agent" && git config user.email "agent@localhost"
2. Tool Wrapper Additions
# Self-Harness proposes wrapping git commandsdefexecute_git_command(cmd):
# Ensure git is configured before any commit operationif"commit"in cmd:
check_git_config()
return subprocess.run(cmd, shell=True)
3. Validation Step Injection
# Self-Harness proposes adding verification after file operationsdefcreate_file(path, content):
write_file(path, content)
# Validate file was created successfullyifnot os.path.exists(path):
raise FileNotFoundError(f"Failed to create {path}")
# Validate content matchesif read_file(path) != content:
raise ValueError("File content mismatch")
4. Planning Template Updates
# Before
Plan: {steps}
# After (Self-Harness Proposal)
Plan:
+ 1. Verify prerequisites (dependencies, configs, permissions)
{steps}
+ N+1. Verify expected outcomes+ N+2. Clean up temporary resources
Diversity Mechanism:
For each weakness, Self-Harness generates 3-5 candidate proposals using different approaches:
Prompt-based guidance
Tool wrapper modifications
Validation additions
Planning template changes
Error recovery procedures
Stage 3: Proposal Validation
Purpose: Accept candidate edits only after regression testing to prevent breaking existing capabilities.
Validation Pipeline:
1. Held-Out Test Set
Training Set: 70% of Terminal-Bench 2.0 tasks (62 tasks)
Validation Set: 30% held-out tasks (27 tasks)
2. Regression Testing
defvalidate_proposal(current_harness, proposed_harness, tasks):
baseline_results = run_benchmark(current_harness, tasks)
proposal_results = run_benchmark(proposed_harness, tasks)
# Accept only if:# 1. No regression on previously passing tasks# 2. Net improvement in pass ratereturn (
no_regression(baseline_results, proposal_results) and
net_improvement(baseline_results, proposal_results)
)
3. Acceptance Criteria
A proposal is accepted if:
✅ No Breaking Changes: All previously passing tasks still pass
✅ Net Positive: Overall pass rate improves
✅ Weakness Addressed: At least one failure from the target weakness now passes
✅ No New Failures: Doesn't introduce failures in unrelated tasks
4. Iterative Application
Once validated, the proposal is:
Merged into the harness
Used for subsequent weakness mining
Compound improvements build on previous fixes
Safety Mechanism:
# Only minimal, targeted changes are acceptedif change_diff_lines > MAX_CHANGE_SIZE:
reject_proposal("Too large, split into smaller changes")
Experimental Results: Terminal-Bench 2.0
The paper evaluated Self-Harness on Terminal-Bench 2.0, the industry-standard benchmark for AI agent evaluation comprising 89 carefully curated tasks across diverse domains.
Experimental Setup
Base Models Tested:
MiniMax M2.5 — Chinese frontier model from MiniMax
Qwen3.5-35B-A3B — Open-weight model from Alibaba Cloud
GLM-5 — Bilingual model from Zhipu AI
Why These Models?
Diverse model families (not all OpenAI/Anthropic)
Different architectures and training approaches
Varying baseline capabilities (23.8% to 42.9% initial pass rate)
Demonstrates generalization across model types
Minimal Initial Harness:
Basic system prompt with role description
Standard tool access (bash, file operations, web search)
No model-specific optimizations
No domain-specific guidance
Performance Improvements
Model
Initial Pass Rate
Final Pass Rate
Absolute Gain
Relative Gain
MiniMax M2.5
40.5%
61.9%
+21.4%
+52.8%
Qwen3.5-35B-A3B
23.8%
38.1%
+14.3%
+60.1%
GLM-5
42.9%
57.1%
+14.2%
+33.1%
Key Findings:
1. Consistent Improvements Across All Models
All three models showed substantial gains (14-21 absolute points)
Improvements were not limited to high-performing models
Even the weakest baseline (Qwen 23.8%) saw 60% relative improvement
2. Model-Specific Harness Modifications
Different models generated different harness changes
MiniMax M2.5 focused on error recovery patterns
Qwen3.5 added more verification steps
GLM-5 improved context management
3. Non-Generic Improvements
Qualitative analysis showed changes were not just adding generic instructions
Each model identified unique weaknesses specific to its behavior
The paper provides detailed examples of discovered weaknesses and resulting harness modifications.
Example 1: Git Configuration Failures (MiniMax M2.5)
Weakness Mining Discovery:
Pattern: 8 failures on tasks requiring git commits
Root Cause: Missing git user.name and user.email configuration
Example Traces:
- Task 23: "Create repo and commit changes" → FAIL (git commit rejected)
- Task 45: "Initialize project with git" → FAIL (identity not configured)
Harness Proposal Generated:
System Prompt Addition:
+ Git Configuration Prerequisite:+ Before any git commit operation, verify configuration:+ - Check: git config user.name+ - Check: git config user.email+ If either is unset, configure defaults:+ git config user.name "Agent"+ git config user.email "agent@localhost"
Validation Results:
8 previously failing tasks now pass
0 regressions on other tasks
✅ Accepted and merged
Impact: +9.0% pass rate improvement on git-related tasks
Example 2: File Verification Gaps (Qwen3.5-35B-A3B)
Weakness Mining Discovery:
Pattern: 12 failures on tasks involving file operations
Root Cause: Agent assumes file operations succeed without verification
Example Traces:
- Task 12: Created config.json but didn't verify, later steps failed
- Task 34: Assumed mkdir succeeded, then tried to cd into non-existent directory
Harness Proposal Generated:
Tool Wrapper Addition:
def create_file(path, content):
execute_bash(f"cat > {path} << 'EOF'\n{content}\nEOF")
+ # Verify file was created+ if not execute_bash(f"test -f {path}"):+ raise FileNotFoundError(f"Failed to create {path}")+ # Verify content matches+ actual = execute_bash(f"cat {path}")+ if actual.strip() != content.strip():+ raise ValueError(f"Content mismatch in {path}")
Validation Results:
10 of 12 failing tasks now pass
2 tasks showed different failures (unrelated to file verification)
0 regressions
✅ Accepted and merged
Impact: +11.2% pass rate improvement on file operation tasks
Example 3: Context Loss in Multi-Step Tasks (GLM-5)
Weakness Mining Discovery:
Pattern: 7 failures on long, multi-step tasks
Root Cause: Model loses track of intermediate results and task state
Example Traces:
- Task 56: Forgot database connection string from step 2 by step 5
- Task 78: Lost API key after environment setup, failed authentication
Harness Proposal Generated:
Planning Template Update:
Plan for completing task:
+ [State Tracking]+ - Track: {key variables to remember}+ - Update tracker after each step completion+
1. {step 1}
+ → Record outcome: {what to remember}
2. {step 2}
+ → Record outcome: {what to remember}
...
N. {final step}
+ → Verify: All tracked variables are still accessible
Validation Results:
6 of 7 failing tasks now pass
1 task failed due to unrelated timeout issue
0 regressions
✅ Accepted and merged
Impact: +6.7% pass rate improvement on multi-step tasks
Comparison to Related Approaches
Self-Harness vs. Human Harness Engineering
Aspect
Human Engineering
Self-Harness
Speed
Days to weeks per model
Hours (automated)
Scalability
Limited by human expertise
Scales with compute
Model-Specificity
Requires manual analysis
Automatically discovers patterns
Consistency
Varies by engineer skill
Systematic and reproducible
Cost
High (expert time)
Low (compute only)
Adaptation
Manual updates needed
Continuous self-improvement
When Human Engineering Still Matters:
✅ Initial harness architecture design
✅ Domain-specific tool selection
✅ Safety and compliance guardrails
✅ Production deployment decisions
Self-Harness vs. Stronger External Models
Some approaches use stronger models (e.g., GPT-5.5) to improve weaker agents. Self-Harness differs:
External Scaffolding Approach:
Uses GPT-5.5 to analyze GPT-4 agent failures
GPT-5.5 proposes fixes for GPT-4 harness
Requires access to superior model
Not self-contained
Self-Harness Approach:
Agent improves its own harness using its own capabilities
No external models required
Truly autonomous improvement
Scales to any model
Philosophical Difference:
"A model should be able to identify and fix its own systematic weaknesses, not rely on a smarter model to tell it what's wrong."
Self-Harness vs. Microsoft SkillOpt
Microsoft's SkillOpt also addresses self-improvement but focuses on skill refinement rather than harness optimization:
Feature
Self-Harness
SkillOpt
Target
Agent harness (system-level)
Individual skills (task-level)
Scope
Cross-task patterns
Single-skill optimization
Method
Trace analysis + proposals
Skill execution feedback
Validation
Regression testing
Skill-specific metrics
Granularity
System prompts, tool wrappers
Skill code and parameters
Complementary Approaches: Both can be used together—SkillOpt optimizes individual skills while Self-Harness improves the overarching framework.
Technical Deep Dive: How Self-Harness Works
Weakness Mining Algorithm
Input: Execution traces from failed and successful tasks
Output: Ranked list of weakness patterns with proposed fixes
Pseudo-code:
defmine_weaknesses(traces, model):
failures = [t for t in traces ifnot t.success]
# Group failures by similarity
clusters = cluster_by_error_pattern(failures)
weaknesses = []
for cluster in clusters:
# Analyze common failure mode
pattern = model.analyze_pattern(cluster)
# Extract root cause
root_cause = model.identify_root_cause(pattern, cluster)
# Count frequency and impact
frequency = len(cluster)
impacted_tasks = extract_task_ids(cluster)
weaknesses.append(Weakness(
pattern=pattern,
root_cause=root_cause,
frequency=frequency,
impacted_tasks=impacted_tasks
))
# Rank by frequency × impactreturnsorted(weaknesses, key=lambda w: w.frequency, reverse=True)
Key Techniques:
Error Clustering: Groups similar failures using embedding similarity
Pattern Extraction: Identifies recurring error types via LLM analysis
Root Cause Analysis: Traces failure back to harness gaps
Output: 3-5 diverse candidate harness modifications
Prompt Template:
You are analyzing your own execution failures to improve your harness.
Weakness Pattern:
{weakness.pattern}
Root Cause:
{weakness.root_cause}
Failed Task Examples:
{weakness.example_traces}
Current Harness:
{current_harness}
Generate 3-5 diverse, minimal harness modifications that would prevent this failure pattern.
For each proposal:
1. Describe the change
2. Explain why it addresses the root cause
3. Provide concrete implementation (system prompt, tool wrapper, or planning template)
4. Estimate impact on other tasks
Keep changes minimal and targeted. Avoid large rewrites.
Diversity Enforcement:
Prompt Variation: Different temperature and top-p settings per proposal
Approach Variety: Require proposals use different modification types
Semantic Distance: Reject proposals too similar to existing ones
Proposal Validation System
Input: Current harness, proposed harness, validation task set
Output: Accept/reject decision with detailed metrics
Validation Workflow:
defvalidate_proposal(current_harness, proposed_harness, val_tasks):
# Baseline performance
baseline_results = run_tasks(current_harness, val_tasks)
baseline_pass_rate = compute_pass_rate(baseline_results)
# Proposed performance
proposal_results = run_tasks(proposed_harness, val_tasks)
proposal_pass_rate = compute_pass_rate(proposal_results)
# Check for regressions
regressions = [
task for task in val_tasks
if baseline_results[task].passed andnot proposal_results[task].passed
]
# Check for improvements
improvements = [
task for task in val_tasks
ifnot baseline_results[task].passed and proposal_results[task].passed
]
# Decision criteriaiflen(regressions) > 0:
return Decision.REJECT, "Introduced regressions"if proposal_pass_rate <= baseline_pass_rate:
return Decision.REJECT, "No net improvement"iflen(improvements) == 0:
return Decision.REJECT, "No tasks improved"return Decision.ACCEPT, f"Improved {len(improvements)} tasks"
Regression Testing:
Full Re-evaluation: All validation tasks re-run with proposed harness
Strict No-Regression: Even one new failure causes rejection
Net Positive Requirement: Overall pass rate must increase
Weakness-Specific Improvement: At least one targeted failure must pass
Limitations and Future Directions
Current Limitations
1. Computational Cost
Each iteration requires running full benchmark suite twice (baseline + proposal)
89 tasks × 2 runs × 5 iterations = 890 agent runs
For expensive models, costs can accumulate ($50-100+ per full optimization)
2. Local Optima Risk
Greedy acceptance of proposals may miss global optimal harness
No backtracking if early proposals lead to dead ends
Might plateau before reaching human-expert-level harness
3. Benchmark Overfitting
Optimizing specifically for Terminal-Bench 2.0
May not generalize to other domains or task types
Held-out validation set mitigates but doesn't eliminate risk
4. Limited to Harness-Fixable Failures
Cannot improve base model capabilities
Some failures are fundamental model limitations
Only helps with failures caused by harness gaps
5. Minimal Harness Assumption
Starts from very basic harness
May not apply to already-optimized production harnesses
Unclear how it composes with existing harness engineering
Promising Future Directions
1. Cross-Model Harness Transfer
Question: Can harness improvements from Model A transfer to Model B?
Approach: Train Self-Harness on cheap model, transfer to expensive model
Potential: Reduce optimization cost for expensive frontier models
2. Multi-Benchmark Generalization
Question: Can harness optimize across multiple benchmarks simultaneously?
Approach: Validate proposals on Terminal-Bench 2.0 + SWE-bench + GAIA
Potential: More generalizable harnesses that work across domains
3. Compositional Harness Modules
Question: Can we build libraries of reusable harness modules?
Approach: Extract successful patterns into plug-and-play components
Potential: Faster initial harness setup for new models
4. Human-in-the-Loop Validation
Question: Can human review improve Self-Harness proposals?
Approach: Expert reviews edge cases and suggests refinements
Potential: Combine automation speed with human insight
5. Continuous Online Improvement
Question: Can Self-Harness improve during production deployment?
Approach: Mine weaknesses from real user interactions, propose fixes
Potential: Agents that continuously adapt to real-world usage patterns
Practical Implications
For AI Agent Developers
What This Means:
Faster Iteration: Spend less time manually tuning harnesses for new models
Model-Agnostic Optimization: Same framework works across GPT, Claude, Gemini, etc.
⚠️ Strict safety/compliance requirements needing human review
⚠️ Limited compute budget for iterative optimization
Hybrid Approach:
1. Human experts design initial harness architecture
2. Self-Harness optimizes model-specific details
3. Humans review and approve proposed changes
4. Deploy optimized harness to production
5. Continuous Self-Harness monitoring for new failure patterns
Connection to Broader Trends
The Agent Harness Engineering Movement
Self-Harness builds on the emerging discipline of agent harness engineering, where differentiation comes from the scaffolding around the model, not just the model itself.
Timeline:
Feb 2026: Mitchell Hashimoto coins "harness engineering"
Mar 2026: LangChain reports +13.7% Terminal-Bench gain (harness-only)
May 2026: Stanford IRIS meta-harness research published
Jun 2026: Self-Harness demonstrates autonomous harness improvement
Philosophical Shift:
"Frontier models are table stakes. Differentiation is the harness—the loop, tools, middleware, and verification around the model."
Comparison to Loop Engineering
Loop engineering focuses on designing effective agent execution loops—the repeated cycle of planning, action, observation, and refinement.
Self-Harness was published on arXiv on June 8, 2026, introducing a paradigm where LLM-based agents autonomously improve their own operating harnesses through weakness mining, harness proposals, and validation—achieving substantial performance gains on Terminal-Bench 2.0 across diverse base models without requiring human engineers or stronger external models.