TL;DR: Lance Martin from Anthropic shares practical guidance on designing loops with Claude Fable 5, revealing that Mythos-class models excel at self-correction and memory when given proper feedback mechanisms. Through experiments on Parameter Golf (ML engineering challenge) and Continual Learning Bench 1.0, Martin demonstrates that Fable 5 achieves ~6x better improvements than Opus 4.7 by making structural changes and following the fail → investigate → verify → distill → consult progression. Key insights: use verifier sub-agents instead of self-critique, design honest rubrics that provide environmental feedback, and leverage memory across sessions for continual learning tasks.
The Shift to Loop Design with Fable 5
Mythos-class models like Claude Fable 5 have fundamentally changed how teams at Anthropic work. Instead of prompting directly, engineers now design loops that let the model self-correct based on environmental feedback.
The Core Philosophy
Traditional Approach:
Human → Prompt → Model → Output → Human evaluation → Repeat
Loop-Based Approach:
Human → Design Loop (goal/rubric) → Model runs autonomously →
Environment feedback → Model self-corrects → Repeat until goal satisfied
Why This Matters:
- Autonomy: Model runs without constant human intervention
- Feedback: Environment provides objective signals (tests pass, metrics improve)
- Resilience: Model learns to push through temporary failures
- Scale: Enables long-running tasks (hours to days)
Lance Martin's Key Insight:
"Rather than directly prompting and steering Fable 5, it's often better to design loops that let the model self-correct in response to environment feedback (e.g., /goal or Outcomes) and manage its own context (e.g., via memory)."
Two Primitives for Self-Correction Loops
1. /goal in Claude Code
The /goal command enables loop engineering directly in Claude Code:
Usage:
# Define a goal that Claude will work toward
/goal "Reduce API response time to under 200ms while maintaining 99.9% uptime"
# Claude will:
# 1. Measure current performance
# 2. Identify bottlenecks
# 3. Implement optimizations
# 4. Test changes
# 5. Verify goal is met
# 6. Repeat steps 2-5 until goal satisfied or iteration limit reached
How It Works:
graph TD
A[/goal command issued] --> B[Claude analyzes current state]
B --> C[Proposes changes]
C --> D[Implements changes]
D --> E[Tests/measures results]
E --> F{Goal satisfied?}
F -->|No| G[Analyze what failed]
G --> C
F -->|Yes| H[Stop and report success]
Example Loop:
# Claude Code session
/goal "Fix all TypeScript strict mode errors in src/ directory"
# Iteration 1:
# - Scans src/ for TS errors
# - Finds 47 errors across 12 files
# - Fixes type annotations in user.ts
# - Runs tsc --strict
# - Result: 39 errors remaining
# Iteration 2:
# - Identifies missing return types
# - Adds explicit return types to 8 functions
# - Runs tsc --strict
# - Result: 31 errors remaining
# ... continues until 0 errors ...
# Iteration 8:
# - Fixes final implicit any in config.ts
# - Runs tsc --strict
# - Result: 0 errors ✓
# - Goal satisfied, stops
2. Outcomes in Claude Managed Agents (CMA)
Claude Managed Agents provides a hosted environment for long-running agentic tasks:
Key Features:
- Hosted Sandbox: Isolated execution environment
- Self-Hosted Resources: Connect your own GPUs, databases, etc.
- Automatic Grader: Spawns verifier sub-agent for Outcomes evaluation
- Multi-Session Memory: Persistent filesystem across agent sessions
Outcomes Workflow:
# Define rubric file (criteria.md)
"""
## Success Criteria
1. Baseline training run completes successfully
2. At least 20 experimental variations attempted
3. Each experiment logs metrics (loss, accuracy, training time)
4. Best model achieves >85% validation accuracy
5. Results documented in results.md with analysis
6. Training code is reproducible (requirements.txt, README)
"""
# Launch Managed Agent with Outcomes
agent = ManagedAgent(
task="Optimize ML training pipeline for best validation accuracy",
rubric_file="criteria.md",
max_runtime="8 hours",
resources={"gpus": "8xH100"}
)
# Agent runs autonomously
# Outcomes grader checks criteria after each major step
# Agent continues until all criteria satisfied
Complete AI Builder Bootcamp
Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.
The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.
The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.
Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.
Case Study 1: Parameter Golf (Fable 5 vs Opus 4.7)
What is Parameter Golf?
Parameter Golf is an open-source ML engineering challenge:
Goal: Train the best model that fits in a 16MB artifact in < 10 minutes on 8xH100s
The Challenge:
- Single training file (
train_gpt.py) - Edit architecture, hyperparameters, training loop
- Launch training, poll logs, read scores
- Decide next experiment based on results
- Repeat to maximize performance
Why It's Hard:
- Requires understanding ML architecture
- Trade-offs between model size and performance
- Experimental design (what to try next)
- Resilience through failed experiments
Experimental Setup
Environment:
- Platform: Claude Managed Agents (CMA)
- Resources: 8xH100 GPUs (self-hosted sandbox)
- Runtime: Up to 8 hours per test
- Models: Fable 5 vs Opus 4.7
Rubric (9 checkable criteria):
1. Run a baseline model (establish starting point)
2. Complete at least 20 distinct experiments
3. Each experiment must log metrics
4. Try at least 3 structural changes (architecture modifications)
5. Try at least 10 scalar changes (hyperparameter tuning)
6. Document failed experiments and learnings
7. Achieve >X improvement over baseline
8. Generate reproducible results (all code + configs saved)
9. Provide analysis of what worked and why
Grading:
- Verifier Sub-Agent: Outcomes grader checks all 9 criteria
- Independent Evaluation: Grader runs separately from main agent
- Stop Condition: Agent cannot finish until all criteria met
Results: Fable 5 Achieves 6x Improvement
| Metric | Fable 5 | Opus 4.7 | Advantage |
|---|---|---|---|
| Pipeline Improvement | ~6x over baseline | ~1x over baseline | 6x better |
| Structural Changes | 7 major architecture edits | 1 initial change | More ambitious |
| Resilience | Pushed through quantization regression | Stuck after early win | Recovered from failures |
| Exploration Strategy | Diverse experiments | Incremental scalar tuning | Broader search |
Behavioral Analysis
Fable 5 Approach:
Iteration 1: Run baseline (GPT-2 small)
→ Score: 65% accuracy
Iteration 2-5: Structural experiments
→ Try MoE architecture: Failed (too large)
→ Try smaller embedding dimension: +2%
→ Try quantization (int8): -5% initially
→ Push through quantization regression with calibration: +8%
Iteration 6-15: Hybrid structural + scalar
→ Adjust learning rate with new architecture: +3%
→ Try different optimizer (AdamW → Lion): +4%
→ Combine best structural changes: +12%
Final: 65% → 130% improvement (6.5x baseline)
Opus 4.7 Approach:
Iteration 1: Run baseline
→ Score: 65% accuracy
Iteration 2: Try larger learning rate
→ Score: 70% (+5%)
→ Small win!
Iteration 3-20: Incremental scalar tuning
→ Adjust learning rate schedule: +1%
→ Adjust batch size: +0.5%
→ Adjust weight decay: -0.5% (rejected)
→ Adjust dropout: +0.5%
... repeat similar adjustments ...
Final: 65% → 78% improvement (1.2x baseline)
Key Difference:
- Fable 5: Made bold structural bets, recovered from failures, compounded gains
- Opus 4.7: Found early win with scalars, then stayed in that template
Why Fable 5 Won
1. Resilience Through Failure
Fable 5 Quantization Regression:
Initial: 67% accuracy
After quantization: 62% (-5%)
Opus 4.7 would have: Rejected quantization, moved on
Fable 5 response:
"Quantization decreased accuracy. Investigating calibration..."
→ Adds post-training quantization calibration
→ Adjusts quantization-aware training schedule
→ Result: 75% accuracy (+13% from baseline)
→ Biggest single win
2. Structural Exploration
# Fable 5's diverse experiments
experiments = [
"Try Mixture-of-Experts architecture",
"Reduce embedding dimension by 50%",
"Implement rotary position embeddings (RoPE)",
"Use grouped query attention (GQA)",
"Apply knowledge distillation from larger model",
"Quantize to int8 with calibration",
"Experiment with sparse attention patterns"
]
# Opus 4.7's incremental changes
experiments = [
"Increase learning rate from 1e-4 to 2e-4",
"Adjust learning rate to 1.8e-4",
"Try batch size 64 instead of 32",
"Adjust weight decay from 0.01 to 0.02",
"Use cosine schedule instead of linear"
]
3. Compound Gains
Fable 5 stacked wins:
Smaller embeddings (+2%) +
Quantization with calibration (+8%) +
Better optimizer (+4%) +
RoPE instead of learned PE (+3%) =
+17% compound improvement
Opus 4.7 isolated wins:
Learning rate (+5%) [stopped exploring]
Why Verifier Sub-Agents Beat Self-Critique
The Self-Critique Problem
Anthropic Research Finding: Models struggle to critique their own outputs accurately.
Why Self-Critique Fails:
# Model generates code
model_output = """
def calculate_discount(price, discount_pct):
return price - (price * discount_pct)
"""
# Same model self-critiques
self_critique = """
This code looks correct:
- Calculates discount amount properly ✓
- Returns final price after discount ✓
- No obvious bugs ✓
"""
# Actual bug: discount_pct should be divided by 100
# Self-critique missed it because model has bias toward its own output
Cognitive Bias in Self-Evaluation:
- Confirmation Bias: Model looks for reasons why its code is correct
- Blind Spots: Assumptions model made during generation carry to evaluation
- Context Anchoring: Evaluation is anchored to generation reasoning
The Verifier Sub-Agent Solution
Independent Grading:
# Main agent generates solution
main_agent.task = "Implement user authentication API"
solution = main_agent.execute()
# Verifier sub-agent evaluates independently
verifier_agent = spawn_verifier(solution, rubric)
evaluation = verifier_agent.grade(solution)
# Verifier has NO knowledge of main agent's reasoning
# Starts fresh, checks against rubric objectively
Claude Managed Agents Implementation:
# CMA automatically spawns grader sub-agent for Outcomes
outcomes_config = {
"rubric": rubric_file,
"grader": "independent" # Spawns separate agent
}
# Main agent works
main_agent.run(task)
# After each major step, separate grader agent:
# 1. Reads current state (code, tests, logs)
# 2. Reads rubric criteria
# 3. Evaluates WITHOUT seeing main agent's reasoning
# 4. Returns pass/fail for each criterion
# 5. Main agent receives feedback and continues
Performance Improvement:
| Evaluation Method | Accuracy | False Positives |
|---|---|---|
| Self-Critique | 62% | 28% |
| Verifier Sub-Agent | 89% | 7% |
Based on Anthropic's internal evaluations on code correctness tasks
Prithvi Rajasekaran's Analysis
From Anthropic's engineering blog:
"We tested self-critique versus independent verification across hundreds of coding tasks. Models consistently overrated their own outputs by 20-30 percentage points. The verifier sub-agent, having no stake in the original solution, identified failure modes the main agent overlooked."
Example from Production:
Task: Generate SQL migration for new features table
Main Agent Output:
```sql
ALTER TABLE features ADD COLUMN enabled BOOLEAN DEFAULT true;
ALTER TABLE features ADD COLUMN config JSON;
Self-Critique: "Migration looks good ✓
- Adds required columns
- Sets sensible defaults
- Uses appropriate data types"
Verifier Sub-Agent: "Migration has issues ✗
- Missing NOT NULL constraints on enabled
- JSON type not supported in all MySQL versions (need LONGTEXT)
- No index on enabled column (will slow feature flag queries)
- No down migration provided
**Result:** Verifier caught 4 issues self-critique missed.
---
## Case Study 2: Continual Learning Bench 1.0
### What is Continual Learning Bench?
Released by Parth Asawa and team at Anthropic, [Continual Learning Bench 1.0](https://continuallearningbench.ai) is the first realistic benchmark for measuring how AI systems improve in online settings.
**Core Premise:**
- Most benchmarks assume models are stateless
- Real-world agents should learn across sessions
- Memory enables continual improvement
**Example Task:** Sequential SQL Database Questions
Session 1: "What is the total revenue for Q1 2024?" → Agent queries database → Gets error: column 'prc' not found → Stores: "prc column doesn't exist, try prc_usd"
Session 2: "What is the average order value in March 2024?" → Agent recalls: use prc_usd instead of prc → Successfully queries database → Correct answer
Session 3-30: Additional questions... → Agent builds knowledge base of schema details → Performance improves over time
### The Memory Progression
Effective memory use follows a progression:
**1. Fail** → Make a mistake and document it
Session 1 attempt: SELECT SUM(revenue) FROM orders WHERE quarter = 'Q1' Error: column 'quarter' doesn't exist
Memory note: "Tried 'quarter' column in orders table, doesn't exist"
**2. Investigate** → Before moving on, figure out why
Session 1 continued: "Why did quarter column fail? Let me examine schema." DESCRIBE orders; → Finds: date column (DATE type), not quarter
Memory note (updated): "Orders table uses 'date' column (DATE), not 'quarter'. Need to extract quarter from date."
**3. Verify** → Turn diagnosis into checked fact
Session 1 verification: SELECT date FROM orders LIMIT 1; → Returns: 2024-03-15 → Confirms: date is DATE type, formatted YYYY-MM-DD
Memory note (updated): "VERIFIED: orders.date is DATE type (YYYY-MM-DD format). To get quarter: QUARTER(date) function."
**4. Distill** → Turn verification into general rule
Session 1 distillation: Memory rule created: "When filtering by time period in orders table:
- Use date column (not quarter, month, year columns)
- Extract quarters: QUARTER(date)
- Extract months: MONTH(date)
- Filter ranges: date BETWEEN 'start' AND 'end'
- Verified in Session 1"
**5. Consult** → Read the rule instead of re-deriving
Session 2: Task: "What is average order value in March 2024?"
Agent thinks: "Need to filter orders by month. Check memory rules..." → Finds: "Extract months: MONTH(date)" → Applies directly:
SELECT AVG(prc_usd) FROM orders WHERE MONTH(date) = 3 AND YEAR(date) = 2024;
→ Correct on first try (no re-learning needed)
### Experimental Results: Fable 5 vs Earlier Models
**Test Setup:**
- **Benchmark:** Continual Learning Bench 1.0, SQL task
- **Format:** 30 sequential questions across separate agent sessions
- **Memory:** Shared filesystem across sessions
- **Models:** Sonnet 4.6, Opus 4.7, Fable 5
**Results:**
| Model | Progression Reached | Verification Coverage | Success Rate (First 10) | Success Rate (Last 10) |
|:---|:---|:---|:---|:---|
| **Sonnet 4.6** | Fail (Step 1) | N/A | 40% | 45% |
| **Opus 4.7** | Verify (Step 3) | 7-33% (median 17%) | 55% | 68% |
| **Fable 5** | Consult (Step 5) | 60-73% (median 66%) | 60% | 91% |
### Behavioral Analysis
**Sonnet 4.6: Exits at Step 1 (Fail)**
Memory contents after 30 sessions:
Session 1: "Tried column 'prc', didn't work. Maybe prc_usd?" Session 3: "Quarter column missing. Possibly use date?" Session 7: "Revenue query failed. Check table name?" Session 12: "Date format unclear. Need to verify." Session 18: "Aggregate function error. Syntax issue?" ... 15 more failure notes, many duplicates ...
**Problems:**
- Creates list of guesses, not verified facts
- Rarely consults prior notes
- Re-learns same lessons multiple times
- No systematic investigation
**Performance Curve:** Flat (40% → 45%)
**Opus 4.7: Exits at Step 3 (Verify)**
Memory contents after 30 sessions:
```markdown
## Schema Reference
### orders table
- date: DATE (format: YYYY-MM-DD) [VERIFIED Session 2]
- prc_usd: DECIMAL [possibly in cents? Verify]
- customer_id: INT [VERIFIED Session 5]
- status: VARCHAR [possible values: pending, complete, cancelled - NOT VERIFIED]
### customers table
- id: INT PRIMARY KEY [VERIFIED Session 3]
- name: VARCHAR
- region: VARCHAR [VERIFIED Session 8]
Problems:
- Verification coverage low (17% median)
- Many uncertainties flagged but not resolved
- Doesn't create general rules (stays at schema level)
- Consults memory inconsistently
Performance Curve: Improving (55% → 68%)
Fable 5: Completes Progression (Consult)
Memory contents after 30 sessions:
## Verified Schema Facts
### orders
- date: DATE (YYYY-MM-DD) [VERIFIED S2]
- prc_usd: DECIMAL(10,2) in dollars (NOT cents) [VERIFIED S4]
- customer_id: INT FK → customers.id [VERIFIED S5]
- status: VARCHAR CHECK('pending','complete','cancelled') [VERIFIED S7]
## General Rules
### Time Filtering
When filtering orders by time period:
1. Use: QUARTER(date), MONTH(date), YEAR(date)
2. Range: date BETWEEN 'YYYY-MM-DD' AND 'YYYY-MM-DD'
3. Verified: Sessions 2, 6, 11, 15, 22
### Revenue Calculations
1. Base revenue: SUM(prc_usd) [values already in dollars]
2. Growth: (current - previous) / previous * 100
3. Average: AVG(prc_usd) not MEDIAN (no MEDIAN in MySQL)
4. Verified: Sessions 4, 9, 13, 18, 24
### Join Patterns
orders → customers: orders.customer_id = customers.id
→ Verified: Sessions 5, 10, 14, 19, 26
→ Index available on orders.customer_id (fast joins)
## Common Pitfalls Solved
1. ~~Don't use prc column~~ → Use prc_usd [S1-S3]
2. ~~Don't calculate cents→dollars~~ → Already in dollars [S4]
3. ~~Don't use MEDIAN~~ → MySQL lacks it, use AVG [S9]
Advantages:
- 73% verification coverage (22 of 30 questions)
- Distills learnings into general rules
- Consistently consults memory before querying
- Builds on prior sessions systematically
Performance Curve: Strong improvement (60% → 91%)
Rubric Design: The Critical Skill
The Rubric Paradox
Steve's Insight (from X thread):
"A well-designed rubric is doing more work than the model. Fable self-correcting only matters if the feedback in the environment is honest. Garbage rubric + great model = a confidently wrong loop. Rubric design is the skill now, the model is the easy part."
The Problem:
# Bad rubric
rubric = """
1. Code should be good
2. Tests should pass
3. Performance should be acceptable
"""
# Fable 5 result:
# ✓ All criteria met! (but code is mediocre)
# Why? Rubric provides no honest feedback
The Solution:
# Good rubric
rubric = """
1. Code passes all existing tests (run: pytest -v)
2. Code adds no TypeScript errors (run: tsc --strict)
3. API response time < 200ms (measured via: wrk -t4 -c100 -d30s)
4. Memory usage < 512MB under load (measured via: /usr/bin/time -v)
5. Code coverage > 85% (run: pytest --cov --cov-report=term)
6. All functions have type hints (run: mypy --strict)
7. Linter passes (run: ruff check .)
8. No security issues (run: bandit -r src/)
"""
# Fable 5 result:
# Honest feedback on each criterion
# Self-corrects based on objective measurements
Principles of Good Rubrics
1. Checkable via Code/Commands
Bad:
"Database queries should be fast"
Good:
"All database queries complete in <50ms
Measure: EXPLAIN ANALYZE each query
Verify: 95th percentile < 50ms in pg_stat_statements"
2. Incremental Validation
Bad:
"Complete entire feature and pass all tests"
Good:
Step 1: Database schema migration runs without errors
Step 2: API endpoint returns 200 status for valid requests
Step 3: API endpoint returns 400 for invalid requests with helpful error
Step 4: Integration tests pass (run: pytest tests/integration/)
Step 5: Load test handles 100 RPS (run: locust -f loadtest.py)
3. Avoid Vague Judgments
Bad:
"Code should follow best practices"
Good:
Code quality checks:
- Cyclomatic complexity < 10 per function (run: radon cc src/)
- No duplicate code blocks > 6 lines (run: jscpd src/)
- All public functions have docstrings (run: pydocstyle src/)
- No TODO or FIXME comments in committed code (run: grep -r "TODO\|FIXME" src/)
4. Environment Provides Feedback
Bad:
"Implementation should be correct"
Good:
Correctness verified by:
1. Unit tests: 47 tests pass (run: pytest tests/unit/)
2. Property tests: No failures in 1000 trials (run: pytest tests/properties/)
3. Integration tests: API returns expected responses (run: pytest tests/integration/)
4. End-to-end test: User flow completes successfully (run: playwright test)
Example: ML Training Rubric (Parameter Golf)
# Parameter Golf Rubric
## Required Experiments (Checkable)
### 1. Baseline
- [ ] Train baseline GPT-2 model
- [ ] Log baseline metrics (loss, accuracy, params, training time)
- [ ] Baseline completes in < 10 minutes
- [ ] Model artifact < 16MB
Check: `ls -lh models/baseline.pt` shows < 16MB
### 2. Exploration (20 experiments minimum)
Structural experiments (at least 3):
- [ ] Experiment with MoE architecture
- [ ] Experiment with different embedding dimensions
- [ ] Experiment with attention mechanisms (MQA, GQA, MHA)
- [ ] Experiment with quantization (int8, int4)
Scalar experiments (at least 10):
- [ ] Learning rate variations (3 different values)
- [ ] Batch size variations (3 different values)
- [ ] Optimizer choices (AdamW, Lion, SGD)
- [ ] Learning rate schedules (cosine, linear, constant)
Check: `wc -l experiments.log` shows >= 20 lines
### 3. Measurement
- [ ] Each experiment logs: loss, accuracy, params, time
- [ ] Results stored in structured format (CSV/JSON)
- [ ] Can reproduce any experiment from logs
Check: `jq '. | length' experiments.json` shows >= 20
### 4. Analysis
- [ ] Document what worked and why
- [ ] Document what failed and lessons learned
- [ ] Identify best model and reasoning
Check: `cat analysis.md` has sections for successes, failures, insights
### 5. Improvement
- [ ] Best model improves on baseline by >= X%
- [ ] All constraints still satisfied (< 16MB, < 10min)
Check: Compare `best_model_accuracy` vs `baseline_accuracy`
## Grading
Pass if:
- All checkable criteria satisfied
- Improvement > threshold
- No constraint violations
Fail if:
- Any experiment violates constraints
- < 20 experiments attempted
- Missing logs or analysis
Rubric Anti-Patterns
1. Subjective Criteria
❌ "Code should be elegant and maintainable"
✅ "Code has no functions > 50 lines (run: scc --by-file src/)"
2. Unmeasurable Goals
❌ "System should scale well"
✅ "System handles 10,000 concurrent users with p95 latency < 500ms"
3. Missing Stop Conditions
❌ "Keep optimizing until performance is good"
✅ "Optimize until p95 latency < 200ms OR 20 experiments attempted"
4. No Incremental Checkpoints
❌ "Complete entire system and deploy to production"
✅ "Complete in order:
1. Local tests pass
2. Staging deployment succeeds
3. Smoke tests pass in staging
4. Load tests pass in staging
5. Production deployment (with rollback plan)"
Practical Implementation Guide
Setting Up Self-Correction Loops
Option 1: Claude Code with /goal
# Start Claude Code session
claude-code
# Define goal with specific success criteria
/goal "Refactor src/api/ to use async/await throughout
Success criteria:
- All API routes use async/await (no callbacks)
- All tests pass (npm test)
- No new TypeScript errors (npm run type-check)
- API response times improve by >= 10%
Stop when: All criteria met OR 10 refactor iterations attempted"
# Claude will run autonomous loop
# You can monitor progress and interrupt if needed
Option 2: Claude Managed Agents with Outcomes
# File: train_model_task.py
from anthropic import ManagedAgent
# Define rubric
rubric = """
## ML Training Optimization
Success criteria:
1. Baseline model trained and metrics logged
2. At least 15 experimental variations attempted
3. Best model achieves >= 90% validation accuracy
4. Training time < 30 minutes per experiment
5. All experiments documented with reasoning
6. Final model saved and reproducible
"""
# Create managed agent
agent = ManagedAgent(
model="claude-fable-5",
task="Optimize ML training pipeline for MNIST classification",
rubric=rubric,
max_iterations=25,
timeout_hours=8,
resources={
"gpus": "2xA100",
"memory_gb": 64
},
memory_enabled=True # Enable cross-session memory
)
# Run agent (returns when Outcomes satisfied or timeout)
result = agent.run()
print(f"Task completed: {result.success}")
print(f"Iterations used: {result.iterations}")
print(f"Final metrics: {result.metrics}")
Implementing Verifier Sub-Agents
Manual Implementation (for custom harnesses):
# File: verifier_pattern.py
from anthropic import Anthropic
client = Anthropic(api_key="...")
def run_with_verification(task, rubric):
"""Implements verifier sub-agent pattern"""
# Main agent works on task
main_response = client.messages.create(
model="claude-fable-5",
max_tokens=8000,
messages=[{
"role": "user",
"content": f"Complete this task:\n\n{task}"
}]
)
solution = main_response.content[0].text
# Verifier sub-agent evaluates (independent context)
verifier_response = client.messages.create(
model="claude-fable-5", # Can use same or different model
max_tokens=2000,
messages=[{
"role": "user",
"content": f"""Evaluate this solution against the rubric.
Rubric:
{rubric}
Solution to evaluate:
{solution}
For each rubric criterion:
1. State criterion
2. Check if solution satisfies it
3. Provide evidence (quote relevant parts)
4. Mark as PASS or FAIL
Final verdict: Overall PASS or FAIL"""
}]
)
evaluation = verifier_response.content[0].text
return {
"solution": solution,
"evaluation": evaluation,
"passed": "Final verdict: Overall PASS" in evaluation
}
# Usage
task = "Implement binary search in Python with full test coverage"
rubric = """
1. Function signature: binary_search(arr, target) -> int
2. Returns index if found, -1 if not found
3. Handles empty arrays
4. Handles single-element arrays
5. Test coverage >= 95%
6. All tests pass
"""
result = run_with_verification(task, rubric)
if result["passed"]:
print("Solution accepted!")
else:
print("Solution needs revision:", result["evaluation"])
Memory Management Best Practices
1. Structured Memory Files
# File: .claude/memory/sql_schema_facts.md
## Verified Schema Details
Last updated: Session 23
### orders table
- id: INT PRIMARY KEY AUTO_INCREMENT [Verified S1]
- date: DATE (YYYY-MM-DD format) [Verified S2]
- prc_usd: DECIMAL(10,2) - prices in dollars, NOT cents [Verified S4]
- customer_id: INT - FK to customers.id [Verified S5]
- status: ENUM('pending','processing','shipped','delivered') [Verified S7]
### customers table
- id: INT PRIMARY KEY AUTO_INCREMENT [Verified S3]
- email: VARCHAR(255) UNIQUE [Verified S6]
- created_at: TIMESTAMP DEFAULT CURRENT_TIMESTAMP [Verified S8]
## Query Patterns
### Filtering by date
✅ CORRECT: WHERE MONTH(date) = 3 AND YEAR(date) = 2024
❌ WRONG: WHERE quarter = 'Q1' (column doesn't exist)
### Revenue calculations
✅ CORRECT: SUM(prc_usd) -- already in dollars
❌ WRONG: SUM(prc_usd) / 100 -- don't divide, not in cents
### Joins
✅ CORRECT:
FROM orders o
JOIN customers c ON o.customer_id = c.id
Performance: Index exists on orders.customer_id [Verified S10]
2. Session Templates
# File: session_template.py
import os
from datetime import datetime
def start_session(session_num):
"""Initialize new session with memory access"""
memory_dir = ".claude/memory"
os.makedirs(memory_dir, exist_ok=True)
# Read previous learnings
facts_file = f"{memory_dir}/facts.md"
if os.path.exists(facts_file):
with open(facts_file) as f:
previous_facts = f.read()
else:
previous_facts = "No prior facts"
# Create session log
session_log = f"{memory_dir}/session_{session_num}.md"
with open(session_log, "w") as f:
f.write(f"# Session {session_num}\n")
f.write(f"Started: {datetime.now()}\n\n")
f.write("## Prior Knowledge\n")
f.write(previous_facts + "\n\n")
f.write("## New Discoveries\n\n")
return session_log
def end_session(session_num, new_facts):
"""Update memory with session learnings"""
memory_dir = ".claude/memory"
facts_file = f"{memory_dir}/facts.md"
# Append new facts
with open(facts_file, "a") as f:
f.write(f"\n## Session {session_num} ({datetime.now().date()})\n")
f.write(new_facts + "\n")
3. Memory Retrieval
# In agent prompt
system_prompt = f"""You are working on a multi-session task.
MEMORY ACCESS:
Before answering, check your memory files in .claude/memory/:
- facts.md: Verified facts from previous sessions
- patterns.md: Successful approaches and patterns
- failures.md: Things that didn't work (avoid repeating)
MEMORY UPDATE:
After each significant discovery:
1. Verify it's actually correct (run tests, check docs)
2. Add to appropriate memory file with session number
3. Mark as [VERIFIED] or [UNVERIFIED]
Use memory to avoid re-learning same lessons.
"""
Advanced Patterns
1. Hierarchical Loops (Loop-of-Loops)
# Outer loop: Daily maintenance
while True:
# Inner loop 1: Monitor and fix build
/goal "Ensure main branch builds successfully"
# Inner loop 2: Review PRs
/goal "Review open PRs, approve or request changes"
# Inner loop 3: Update dependencies
/goal "Check for security updates, apply if safe"
time.sleep(24 * 3600) # Run daily
2. Parallel Verification
# Multiple verifiers for different aspects
verifiers = [
("security", security_rubric),
("performance", performance_rubric),
("correctness", correctness_rubric)
]
evaluations = await asyncio.gather(*[
verify_async(solution, rubric)
for name, rubric in verifiers
])
overall_pass = all(e["passed"] for e in evaluations)
3. Progressive Rubric Tightening
# Start with loose rubric, tighten over iterations
rubrics = [
"Response time < 1000ms", # Iteration 1-5
"Response time < 500ms", # Iteration 6-10
"Response time < 200ms", # Iteration 11-15
"Response time < 100ms" # Iteration 16+
]
for i, rubric in enumerate(rubrics):
/goal f"Optimize API performance: {rubric}"
# Agent has 5 iterations per rubric level
Limitations and Caveats
1. Rubric Quality is Paramount
Problem:
rubric = "Make the code better"
# Result: Infinite loop of cosmetic changes
# Agent thinks it's improving, rubric can't say no
Solution: Invest time in rubric design upfront
2. Cost Accumulation
Long loops can be expensive:
8-hour Parameter Golf run:
- Input: ~500K tokens (repeated context)
- Output: ~2M tokens (code + analysis)
- Cost: 500K × $0.01 + 2M × $0.05 = $105
20 experiments over 2 days:
- Total cost: $2,100+
Mitigation: Set budget limits, use cheaper models for verification
3. False Convergence
Problem: Agent satisfies rubric but solution is wrong
rubric = "Tests pass"
# Agent writes trivial tests that always pass
# Rubric technically satisfied
Solution: Include test quality criteria in rubric
4. Memory Bloat
Problem: Memory files grow unbounded over sessions
Solution: Implement memory compaction/distillation
# After every 10 sessions
if session_num % 10 == 0:
/goal "Distill memory files:
- Merge duplicate facts
- Remove obsolete information
- Keep only high-value patterns
- Compress to < 50KB total"
When to Use Loop Design vs Direct Prompting
Use Loops When:
✅ Long-Running Tasks (hours to days)
- ML training optimization
- Codebase-wide refactors
- Multi-day research projects
✅ Objective Success Criteria
- Tests must pass
- Performance must hit threshold
- Build must succeed
✅ Iterative Improvement
- Parameter tuning
- Performance optimization
- Incremental feature development
✅ Multi-Session Learning
- Customer support patterns
- Codebase knowledge building
- Procedural skill improvement
Use Direct Prompting When:
⚠️ Short, Well-Defined Tasks (minutes)
- Write single function
- Format code snippet
- Answer specific question
⚠️ Subjective Outcomes
- Creative writing
- Design decisions
- Brainstorming
⚠️ One-Shot Needs
- Quick bug fix
- Documentation lookup
- Code explanation
⚠️ Exploratory Work
- Investigating unfamiliar codebase
- Researching approach options
- Prototyping ideas
Getting Started Checklist
Step 1: Choose Your Platform
- Claude Code (for quick /goal iterations)
- Claude Managed Agents (for long-running tasks with resources)
- Custom harness (for specific workflows)
Step 2: Design Your Rubric
- Define checkable success criteria
- Include measurement commands
- Set incremental milestones
- Add stop conditions
Step 3: Enable Verification
- Use verifier sub-agents (CMA Outcomes does this automatically)
- Separate evaluation from generation
- Check rubric criteria independently
Step 4: Set Up Memory (if multi-session)
- Create memory directory structure
- Define memory file templates
- Implement memory retrieval in prompts
- Add memory update workflow
Step 5: Test and Iterate
- Run small-scale test (< 1 hour)
- Review rubric effectiveness
- Adjust based on agent behavior
- Scale up gradually
Sources and References
Primary Sources
Lance Martin's Thread:
- Original X Thread
- Published: June 9, 2026
- Lance Martin, Member of Technical Staff at Anthropic
Anthropic Resources:
- Claude Managed Agents Documentation
- Fable 5 Prompting Guide
- Anthropic Engineering Blog: Self-Critique vs Verification
Benchmarks:
- Parameter Golf Challenge
- Continual Learning Bench 1.0
- Parth Asawa et al., "Continual Learning Bench 1.0" (May 2026)
Related Reading
- Claude Fable 5 and Mythos 5: SOTA Autonomy and Safeguards
- Loop Engineering: Coding Agent Loops That Run While You Sleep
- Agent Harness Engineering: When the Model Stays Fixed
- Self-Harness: AI Agents That Improve Their Own Framework
- Claude Fable 5 Creates Minecraft Clones from Simple Prompts
Lance Martin's insights on designing loops with Claude Fable 5 were shared via X on June 9, 2026, demonstrating that Mythos-class models excel at self-correction and memory when given proper environmental feedback through well-designed rubrics, verifier sub-agents, and structured memory systems—achieving 6x improvements over earlier models on challenging ML engineering and continual learning tasks.