Designing Loops with Claude Fable 5: Self-Correction and Memory Guide
Learn how to design effective loops with Claude Fable 5 for self-correction and memoryβfrom Parameter Golf experiments to Continual Learning Bench, with verifier sub-agents and rubric design.
Status (June 27, 2026): Fable 5 remains suspended β loop patterns below apply when access returns; use Opus 4.8 meanwhile. Live status β
TL;DR: Lance Martin from Anthropic shares practical guidance on designing loops with Claude Fable 5, revealing that Mythos-class models excel at self-correction and memory when given proper feedback mechanisms. Through experiments on Parameter Golf (ML engineering challenge) and Continual Learning Bench 1.0, Martin demonstrates that Fable 5 achieves ~6x better improvements than Opus 4.7 by making structural changes and following the fail β investigate β verify β distill β consult progression. Key insights: use verifier sub-agents instead of self-critique, design honest rubrics that provide environmental feedback, and leverage memory across sessions for continual learning tasks.
The Shift to Loop Design with Fable 5
Mythos-class models like Claude Fable 5 have fundamentally changed how teams at Anthropic work. Instead of prompting directly, engineers now design loops that let the model self-correct based on environmental feedback.
The Core Philosophy
Traditional Approach:
Human β Prompt β Model β Output β Human evaluation β Repeat
Loop-Based Approach:
Human β Design Loop (goal/rubric) β Model runs autonomously β
Environment feedback β Model self-corrects β Repeat until goal satisfied
Learn more about loop engineering
Why This Matters:
Autonomy: Model runs without constant human intervention
Resilience: Model learns to push through temporary failures
Scale: Enables long-running tasks (hours to days)
Lance Martin's Key Insight:
"Rather than directly prompting and steering Fable 5, it's often better to design loops that let the model self-correct in response to environment feedback (e.g., /goal or Outcomes) and manage its own context (e.g., via memory)."
Two Primitives for Self-Correction Loops
1. /goal in Claude Code
The /goal command enables loop engineering directly in Claude Code:
Usage:
# Define a goal that Claude will work toward
/goal "Reduce API response time to under 200ms while maintaining 99.9% uptime"# Claude will:# 1. Measure current performance# 2. Identify bottlenecks# 3. Implement optimizations# 4. Test changes# 5. Verify goal is met# 6. Repeat steps 2-5 until goal satisfied or iteration limit reached
How It Works:
graph TD
A[/goal command issued] --> B[Claude analyzes current state]
B --> C[Proposes changes]
C --> D[Implements changes]
D --> E[Tests/measures results]
E --> F{Goal satisfied?}
F -->|No| G[Analyze what failed]
G --> C
F -->|Yes| H[Stop and report success]
Example Loop:
# Claude Code session
/goal "Fix all TypeScript strict mode errors in src/ directory"# Iteration 1:# - Scans src/ for TS errors# - Finds 47 errors across 12 files# - Fixes type annotations in user.ts# - Runs tsc --strict# - Result: 39 errors remaining# Iteration 2:# - Identifies missing return types# - Adds explicit return types to 8 functions# - Runs tsc --strict# - Result: 31 errors remaining# ... continues until 0 errors ...# Iteration 8:# - Fixes final implicit any in config.ts# - Runs tsc --strict# - Result: 0 errors β# - Goal satisfied, stops
Self-Hosted Resources: Connect your own GPUs, databases, etc.
Automatic Grader: Spawns verifier sub-agent for Outcomes evaluation
Multi-Session Memory: Persistent filesystem across agent sessions
Outcomes Workflow:
# Define rubric file (criteria.md)"""
## Success Criteria
1. Baseline training run completes successfully
2. At least 20 experimental variations attempted
3. Each experiment logs metrics (loss, accuracy, training time)
4. Best model achieves >85% validation accuracy
5. Results documented in results.md with analysis
6. Training code is reproducible (requirements.txt, README)
"""# Launch Managed Agent with Outcomes
agent = ManagedAgent(
task="Optimize ML training pipeline for best validation accuracy",
rubric_file="criteria.md",
max_runtime="8 hours",
resources={"gpus": "8xH100"}
)
# Agent runs autonomously# Outcomes grader checks criteria after each major step# Agent continues until all criteria satisfied
Case Study 1: Parameter Golf (Fable 5 vs Opus 4.7)
What is Parameter Golf?
Parameter Golf is an open-source ML engineering challenge:
Goal: Train the best model that fits in a 16MB artifact in < 10 minutes on 8xH100s
The Challenge:
Single training file (train_gpt.py)
Edit architecture, hyperparameters, training loop
Launch training, poll logs, read scores
Decide next experiment based on results
Repeat to maximize performance
Why It's Hard:
Requires understanding ML architecture
Trade-offs between model size and performance
Experimental design (what to try next)
Resilience through failed experiments
Experimental Setup
Environment:
Platform: Claude Managed Agents (CMA)
Resources: 8xH100 GPUs (self-hosted sandbox)
Runtime: Up to 8 hours per test
Models: Fable 5 vs Opus 4.7
Rubric (9 checkable criteria):
1. Run a baseline model (establish starting point)
2. Complete at least 20 distinct experiments
3. Each experiment must log metrics
4. Try at least 3 structural changes (architecture modifications)
5. Try at least 10 scalar changes (hyperparameter tuning)
6. Document failed experiments and learnings
7. Achieve >X improvement over baseline
8. Generate reproducible results (all code + configs saved)
9. Provide analysis of what worked and why
Grading:
Verifier Sub-Agent: Outcomes grader checks all 9 criteria
Independent Evaluation: Grader runs separately from main agent
Stop Condition: Agent cannot finish until all criteria met
Fable 5: Made bold structural bets, recovered from failures, compounded gains
Opus 4.7: Found early win with scalars, then stayed in that template
Why Fable 5 Won
1. Resilience Through Failure
Fable 5 Quantization Regression:
Initial: 67% accuracy
After quantization: 62% (-5%)
Opus 4.7 would have: Rejected quantization, moved on
Fable 5 response:
"Quantization decreased accuracy. Investigating calibration..."
β Adds post-training quantization calibration
β Adjusts quantization-aware training schedule
β Result: 75% accuracy (+13% from baseline)
β Biggest single win
2. Structural Exploration
# Fable 5's diverse experiments
experiments = [
"Try Mixture-of-Experts architecture",
"Reduce embedding dimension by 50%",
"Implement rotary position embeddings (RoPE)",
"Use grouped query attention (GQA)",
"Apply knowledge distillation from larger model",
"Quantize to int8 with calibration",
"Experiment with sparse attention patterns"
]
# Opus 4.7's incremental changes
experiments = [
"Increase learning rate from 1e-4 to 2e-4",
"Adjust learning rate to 1.8e-4",
"Try batch size 64 instead of 32",
"Adjust weight decay from 0.01 to 0.02",
"Use cosine schedule instead of linear"
]
3. Compound Gains
Fable 5 stacked wins:
Smaller embeddings (+2%) +
Quantization with calibration (+8%) +
Better optimizer (+4%) +
RoPE instead of learned PE (+3%) =
+17% compound improvement
Opus 4.7 isolated wins:
Learning rate (+5%) [stopped exploring]
Why Verifier Sub-Agents Beat Self-Critique
The Self-Critique Problem
Anthropic Research Finding: Models struggle to critique their own outputs accurately.
Why Self-Critique Fails:
# Model generates code
model_output = """
def calculate_discount(price, discount_pct):
return price - (price * discount_pct)
"""# Same model self-critiques
self_critique = """
This code looks correct:
- Calculates discount amount properly β
- Returns final price after discount β
- No obvious bugs β
"""# Actual bug: discount_pct should be divided by 100# Self-critique missed it because model has bias toward its own output
Cognitive Bias in Self-Evaluation:
Confirmation Bias: Model looks for reasons why its code is correct
Blind Spots: Assumptions model made during generation carry to evaluation
Context Anchoring: Evaluation is anchored to generation reasoning
The Verifier Sub-Agent Solution
Independent Grading:
# Main agent generates solution
main_agent.task = "Implement user authentication API"
solution = main_agent.execute()
# Verifier sub-agent evaluates independently
verifier_agent = spawn_verifier(solution, rubric)
evaluation = verifier_agent.grade(solution)
# Verifier has NO knowledge of main agent's reasoning# Starts fresh, checks against rubric objectively
Claude Managed Agents Implementation:
# CMA automatically spawns grader sub-agent for Outcomes
outcomes_config = {
"rubric": rubric_file,
"grader": "independent"# Spawns separate agent
}
# Main agent works
main_agent.run(task)
# After each major step, separate grader agent:# 1. Reads current state (code, tests, logs)# 2. Reads rubric criteria# 3. Evaluates WITHOUT seeing main agent's reasoning# 4. Returns pass/fail for each criterion# 5. Main agent receives feedback and continues
Performance Improvement:
Evaluation Method
Accuracy
False Positives
Self-Critique
62%
28%
Verifier Sub-Agent
89%
7%
Based on Anthropic's internal evaluations on code correctness tasks
Prithvi Rajasekaran's Analysis
From Anthropic's engineering blog:
"We tested self-critique versus independent verification across hundreds of coding tasks. Models consistently overrated their own outputs by 20-30 percentage points. The verifier sub-agent, having no stake in the original solution, identified failure modes the main agent overlooked."
Example from Production:
Task: Generate SQL migration for new features table
Main Agent Output:
```sql
ALTER TABLE features ADD COLUMN enabled BOOLEAN DEFAULT true;
ALTER TABLE features ADD COLUMN config JSON;
Self-Critique:
"Migration looks good β
Adds required columns
Sets sensible defaults
Uses appropriate data types"
Verifier Sub-Agent:
"Migration has issues β
Missing NOT NULL constraints on enabled
JSON type not supported in all MySQL versions (need LONGTEXT)
No index on enabled column (will slow feature flag queries)
No down migration provided
**Result:** Verifier caught 4 issues self-critique missed.
---
## Case Study 2: Continual Learning Bench 1.0
### What is Continual Learning Bench?
Released by Parth Asawa and team at Anthropic, [Continual Learning Bench 1.0](https://continuallearningbench.ai) is the first realistic benchmark for measuring how AI systems improve in online settings.
**Core Premise:**
- Most benchmarks assume models are stateless
- Real-world agents should learn across sessions
- Memory enables continual improvement
**Example Task:** Sequential SQL Database Questions
Session 1: "What is the total revenue for Q1 2024?"
β Agent queries database
β Gets error: column 'prc' not found
β Stores: "prc column doesn't exist, try prc_usd"
Session 2: "What is the average order value in March 2024?"
β Agent recalls: use prc_usd instead of prc
β Successfully queries database
β Correct answer
Session 3-30: Additional questions...
β Agent builds knowledge base of schema details
β Performance improves over time
### The Memory Progression
Effective memory use follows a progression:
**1. Fail** β Make a mistake and document it
Session 1 attempt:
SELECT SUM(revenue) FROM orders WHERE quarter = 'Q1'
Error: column 'quarter' doesn't exist
Memory note:
"Tried 'quarter' column in orders table, doesn't exist"
**2. Investigate** β Before moving on, figure out why
Session 1 continued:
"Why did quarter column fail? Let me examine schema."
DESCRIBE orders;
β Finds: date column (DATE type), not quarter
Memory note (updated):
"Orders table uses 'date' column (DATE), not 'quarter'.
Need to extract quarter from date."
**3. Verify** β Turn diagnosis into checked fact
Session 1 verification:
SELECT date FROM orders LIMIT 1;
β Returns: 2024-03-15
β Confirms: date is DATE type, formatted YYYY-MM-DD
Memory note (updated):
"VERIFIED: orders.date is DATE type (YYYY-MM-DD format).
To get quarter: QUARTER(date) function."
**4. Distill** β Turn verification into general rule
Session 1 distillation:
Memory rule created:
"When filtering by time period in orders table:
Use date column (not quarter, month, year columns)
Extract quarters: QUARTER(date)
Extract months: MONTH(date)
Filter ranges: date BETWEEN 'start' AND 'end'
Verified in Session 1"
**5. Consult** β Read the rule instead of re-deriving
Session 2:
Task: "What is average order value in March 2024?"
Agent thinks:
"Need to filter orders by month. Check memory rules..."
β Finds: "Extract months: MONTH(date)"
β Applies directly:
SELECT AVG(prc_usd)
FROM orders
WHERE MONTH(date) = 3 AND YEAR(date) = 2024;
"A well-designed rubric is doing more work than the model. Fable self-correcting only matters if the feedback in the environment is honest. Garbage rubric + great model = a confidently wrong loop. Rubric design is the skill now, the model is the easy part."
The Problem:
# Bad rubric
rubric = """
1. Code should be good
2. Tests should pass
3. Performance should be acceptable
"""# Fable 5 result:# β All criteria met! (but code is mediocre)# Why? Rubric provides no honest feedback
The Solution:
# Good rubric
rubric = """
1. Code passes all existing tests (run: pytest -v)
2. Code adds no TypeScript errors (run: tsc --strict)
3. API response time < 200ms (measured via: wrk -t4 -c100 -d30s)
4. Memory usage < 512MB under load (measured via: /usr/bin/time -v)
5. Code coverage > 85% (run: pytest --cov --cov-report=term)
6. All functions have type hints (run: mypy --strict)
7. Linter passes (run: ruff check .)
8. No security issues (run: bandit -r src/)
"""# Fable 5 result:# Honest feedback on each criterion# Self-corrects based on objective measurements
Principles of Good Rubrics
1. Checkable via Code/Commands
Bad:
"Database queries should be fast"
Good:
"All database queries complete in <50ms
Measure: EXPLAIN ANALYZE each query
Verify: 95th percentile < 50ms in pg_stat_statements"
2. Incremental Validation
Bad:
"Complete entire feature and pass all tests"
Good:
Step 1: Database schema migration runs without errors
Step 2: API endpoint returns 200 status for valid requests
Step 3: API endpoint returns 400 for invalid requests with helpful error
Step 4: Integration tests pass (run: pytest tests/integration/)
Step 5: Load test handles 100 RPS (run: locust -f loadtest.py)
3. Avoid Vague Judgments
Bad:
"Code should follow best practices"
Good:
Code quality checks:
- Cyclomatic complexity < 10 per function (run: radon cc src/)
- No duplicate code blocks > 6 lines (run: jscpd src/)
- All public functions have docstrings (run: pydocstyle src/)
- No TODO or FIXME comments in committed code (run: grep -r "TODO\|FIXME" src/)
4. Environment Provides Feedback
Bad:
"Implementation should be correct"
Good:
Correctness verified by:
1. Unit tests: 47 tests pass (run: pytest tests/unit/)
2. Property tests: No failures in 1000 trials (run: pytest tests/properties/)
3. Integration tests: API returns expected responses (run: pytest tests/integration/)
4. End-to-end test: User flow completes successfully (run: playwright test)
Example: ML Training Rubric (Parameter Golf)
# Parameter Golf Rubric## Required Experiments (Checkable)### 1. Baseline- [ ] Train baseline GPT-2 model
- [ ] Log baseline metrics (loss, accuracy, params, training time)
- [ ] Baseline completes in < 10 minutes
- [ ] Model artifact < 16MB
Check: `ls -lh models/baseline.pt` shows < 16MB
### 2. Exploration (20 experiments minimum)
Structural experiments (at least 3):
- [ ] Experiment with MoE architecture
- [ ] Experiment with different embedding dimensions
- [ ] Experiment with attention mechanisms (MQA, GQA, MHA)
- [ ] Experiment with quantization (int8, int4)
Scalar experiments (at least 10):
- [ ] Learning rate variations (3 different values)
- [ ] Batch size variations (3 different values)
- [ ] Optimizer choices (AdamW, Lion, SGD)
- [ ] Learning rate schedules (cosine, linear, constant)
Check: `wc -l experiments.log` shows >= 20 lines
### 3. Measurement- [ ] Each experiment logs: loss, accuracy, params, time
- [ ] Results stored in structured format (CSV/JSON)
- [ ] Can reproduce any experiment from logs
Check: `jq '. | length' experiments.json` shows >= 20
### 4. Analysis- [ ] Document what worked and why
- [ ] Document what failed and lessons learned
- [ ] Identify best model and reasoning
Check: `cat analysis.md` has sections for successes, failures, insights
### 5. Improvement- [ ] Best model improves on baseline by >= X%
- [ ] All constraints still satisfied (< 16MB, < 10min)
Check: Compare `best_model_accuracy` vs `baseline_accuracy`## Grading
Pass if:
- All checkable criteria satisfied
- Improvement > threshold
- No constraint violations
Fail if:
- Any experiment violates constraints
< 20 experiments attempted
Missing logs or analysis
Rubric Anti-Patterns
1. Subjective Criteria
β "Code should be elegant and maintainable"
β "Code has no functions > 50 lines (run: scc --by-file src/)"
2. Unmeasurable Goals
β "System should scale well"
β "System handles 10,000 concurrent users with p95 latency < 500ms"
3. Missing Stop Conditions
β "Keep optimizing until performance is good"
β "Optimize until p95 latency < 200ms OR 20 experiments attempted"
4. No Incremental Checkpoints
β "Complete entire system and deploy to production"
β "Complete in order:
1. Local tests pass
2. Staging deployment succeeds
3. Smoke tests pass in staging
4. Load tests pass in staging
5. Production deployment (with rollback plan)"
Practical Implementation Guide
Setting Up Self-Correction Loops
Option 1: Claude Code with /goal
# Start Claude Code session
claude-code
# Define goal with specific success criteria
/goal "Refactor src/api/ to use async/await throughout
Success criteria:
- All API routes use async/await (no callbacks)
- All tests pass (npm test)
- No new TypeScript errors (npm run type-check)
- API response times improve by >= 10%
Stop when: All criteria met OR 10 refactor iterations attempted"# Claude will run autonomous loop# You can monitor progress and interrupt if needed
Option 2: Claude Managed Agents with Outcomes
# File: train_model_task.pyfrom anthropic import ManagedAgent
# Define rubric
rubric = """
## ML Training Optimization
Success criteria:
1. Baseline model trained and metrics logged
2. At least 15 experimental variations attempted
3. Best model achieves >= 90% validation accuracy
4. Training time < 30 minutes per experiment
5. All experiments documented with reasoning
6. Final model saved and reproducible
"""# Create managed agent
agent = ManagedAgent(
model="claude-fable-5",
task="Optimize ML training pipeline for MNIST classification",
rubric=rubric,
max_iterations=25,
timeout_hours=8,
resources={
"gpus": "2xA100",
"memory_gb": 64
},
memory_enabled=True# Enable cross-session memory
)
# Run agent (returns when Outcomes satisfied or timeout)
result = agent.run()
print(f"Task completed: {result.success}")
print(f"Iterations used: {result.iterations}")
print(f"Final metrics: {result.metrics}")
Implementing Verifier Sub-Agents
Manual Implementation (for custom harnesses):
# File: verifier_pattern.pyfrom anthropic import Anthropic
client = Anthropic(api_key="...")
defrun_with_verification(task, rubric):
"""Implements verifier sub-agent pattern"""# Main agent works on task
main_response = client.messages.create(
model="claude-fable-5",
max_tokens=8000,
messages=[{
"role": "user",
"content": f"Complete this task:\n\n{task}"
}]
)
solution = main_response.content[0].text
# Verifier sub-agent evaluates (independent context)
verifier_response = client.messages.create(
model="claude-fable-5", # Can use same or different model
max_tokens=2000,
messages=[{
"role": "user",
"content": f"""Evaluate this solution against the rubric.
Rubric:
{rubric}
Solution to evaluate:
{solution}
For each rubric criterion:
1. State criterion
2. Check if solution satisfies it
3. Provide evidence (quote relevant parts)
4. Mark as PASS or FAIL
Final verdict: Overall PASS or FAIL"""
}]
)
evaluation = verifier_response.content[0].text
return {
"solution": solution,
"evaluation": evaluation,
"passed": "Final verdict: Overall PASS"in evaluation
}
# Usage
task = "Implement binary search in Python with full test coverage"
rubric = """
1. Function signature: binary_search(arr, target) -> int
2. Returns index if found, -1 if not found
3. Handles empty arrays
4. Handles single-element arrays
5. Test coverage >= 95%
6. All tests pass
"""
result = run_with_verification(task, rubric)
if result["passed"]:
()
:
(, result[])
Memory Management Best Practices
1. Structured Memory Files
# File: .claude/memory/sql_schema_facts.md## Verified Schema Details
Last updated: Session 23
### orders table- id: INT PRIMARY KEY AUTO_INCREMENT [Verified S1]
- date: DATE (YYYY-MM-DD format) [Verified S2]
- prc_usd: DECIMAL(10,2) - prices in dollars, NOT cents [Verified S4]
- customer_id: INT - FK to customers.id [Verified S5]
- status: ENUM('pending','processing','shipped','delivered') [Verified S7]
### customers table
- id: INT PRIMARY KEY AUTO_INCREMENT [Verified S3]
- email: VARCHAR(255) UNIQUE [Verified S6]
- created_at: TIMESTAMP DEFAULT CURRENT_TIMESTAMP [Verified S8]
## Query Patterns### Filtering by date
β CORRECT: WHERE MONTH(date) = 3 AND YEAR(date) = 2024
β WRONG: WHERE quarter = 'Q1' (column doesn't exist)
### Revenue calculations
β CORRECT: SUM(prc_usd) -- already in dollars
β WRONG: SUM(prc_usd) / 100 -- don't divide, not in cents
### Joins
β CORRECT:
FROM orders o
JOIN customers c ON o.customer_id = c.id
Performance: Index exists on orders.customer_id [Verified S10]
2. Session Templates
# File: session_template.pyimport os
from datetime import datetime
defstart_session(session_num):
"""Initialize new session with memory access"""
memory_dir = ".claude/memory"
os.makedirs(memory_dir, exist_ok=True)
# Read previous learnings
facts_file = f"{memory_dir}/facts.md"if os.path.exists(facts_file):
withopen(facts_file) as f:
previous_facts = f.read()
else:
previous_facts = "No prior facts"# Create session log
session_log = f"{memory_dir}/session_{session_num}.md"withopen(session_log, "w") as f:
f.write(f"# Session {session_num}\n")
f.write(f"Started: {datetime.now()}\n\n")
f.write("## Prior Knowledge\n")
f.write(previous_facts + "\n\n")
f.write("## New Discoveries\n\n")
return session_log
defend_session(session_num, new_facts):
"""Update memory with session learnings"""
memory_dir = ".claude/memory"
facts_file = f"{memory_dir}/facts.md"# Append new factswithopen(facts_file, "a") as f:
f.write(f"\n## Session {session_num} ({datetime.now().date()})\n")
f.write(new_facts + )
3. Memory Retrieval
# In agent prompt
system_prompt = f"""You are working on a multi-session task.
MEMORY ACCESS:
Before answering, check your memory files in .claude/memory/:
- facts.md: Verified facts from previous sessions
- patterns.md: Successful approaches and patterns
- failures.md: Things that didn't work (avoid repeating)
MEMORY UPDATE:
After each significant discovery:
1. Verify it's actually correct (run tests, check docs)
2. Add to appropriate memory file with session number
3. Mark as [VERIFIED] or [UNVERIFIED]
Use memory to avoid re-learning same lessons.
"""
Advanced Patterns
1. Hierarchical Loops (Loop-of-Loops)
# Outer loop: Daily maintenancewhileTrue:
# Inner loop 1: Monitor and fix build
/goal "Ensure main branch builds successfully"# Inner loop 2: Review PRs
/goal "Review open PRs, approve or request changes"# Inner loop 3: Update dependencies
/goal "Check for security updates, apply if safe"
time.sleep(24 * 3600) # Run daily
2. Parallel Verification
# Multiple verifiers for different aspects
verifiers = [
("security", security_rubric),
("performance", performance_rubric),
("correctness", correctness_rubric)
]
evaluations = await asyncio.gather(*[
verify_async(solution, rubric)
for name, rubric in verifiers
])
overall_pass = all(e["passed"] for e in evaluations)
3. Progressive Rubric Tightening
# Start with loose rubric, tighten over iterations
rubrics = [
"Response time < 1000ms", # Iteration 1-5"Response time < 500ms", # Iteration 6-10"Response time < 200ms", # Iteration 11-15"Response time < 100ms"# Iteration 16+
]
for i, rubric inenumerate(rubrics):
/goal f"Optimize API performance: {rubric}"# Agent has 5 iterations per rubric level
Limitations and Caveats
1. Rubric Quality is Paramount
Problem:
rubric = "Make the code better"# Result: Infinite loop of cosmetic changes# Agent thinks it's improving, rubric can't say no
Update (June 27, 2026): Fable 5 remains suspended β zero traffic confirmed. GPT 5.6 gated preview reported. Live status β
Lance Martin's insights on designing loops with Claude Fable 5 were shared via X on June 9, 2026, demonstrating that Mythos-class models excel at self-correction and memory when given proper environmental feedback through well-designed rubrics, verifier sub-agents, and structured memory systemsβachieving 6x improvements over earlier models on challenging ML engineering and continual learning tasks.