← Blog
explainx / blog

Designing Loops with Claude Fable 5: Self-Correction and Memory Guide

Learn how to design effective loops with Claude Fable 5 for self-correction and memory—from Parameter Golf experiments to Continual Learning Bench, with verifier sub-agents and rubric design.

9 min readYash Thakker
Claude Fable 5Loop EngineeringAI AgentsSelf-CorrectionAgent Memory

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

Designing Loops with Claude Fable 5: Self-Correction and Memory Guide

TL;DR: Lance Martin from Anthropic shares practical guidance on designing loops with Claude Fable 5, revealing that Mythos-class models excel at self-correction and memory when given proper feedback mechanisms. Through experiments on Parameter Golf (ML engineering challenge) and Continual Learning Bench 1.0, Martin demonstrates that Fable 5 achieves ~6x better improvements than Opus 4.7 by making structural changes and following the fail → investigate → verify → distill → consult progression. Key insights: use verifier sub-agents instead of self-critique, design honest rubrics that provide environmental feedback, and leverage memory across sessions for continual learning tasks.


The Shift to Loop Design with Fable 5

Mythos-class models like Claude Fable 5 have fundamentally changed how teams at Anthropic work. Instead of prompting directly, engineers now design loops that let the model self-correct based on environmental feedback.

The Core Philosophy

Traditional Approach:

Human → Prompt → Model → Output → Human evaluation → Repeat

Loop-Based Approach:

Human → Design Loop (goal/rubric) → Model runs autonomously →
Environment feedback → Model self-corrects → Repeat until goal satisfied

Why This Matters:

  • Autonomy: Model runs without constant human intervention
  • Feedback: Environment provides objective signals (tests pass, metrics improve)
  • Resilience: Model learns to push through temporary failures
  • Scale: Enables long-running tasks (hours to days)

Lance Martin's Key Insight:

"Rather than directly prompting and steering Fable 5, it's often better to design loops that let the model self-correct in response to environment feedback (e.g., /goal or Outcomes) and manage its own context (e.g., via memory)."


Two Primitives for Self-Correction Loops

1. /goal in Claude Code

The /goal command enables loop engineering directly in Claude Code:

Usage:

# Define a goal that Claude will work toward
/goal "Reduce API response time to under 200ms while maintaining 99.9% uptime"

# Claude will:
# 1. Measure current performance
# 2. Identify bottlenecks
# 3. Implement optimizations
# 4. Test changes
# 5. Verify goal is met
# 6. Repeat steps 2-5 until goal satisfied or iteration limit reached

How It Works:

graph TD
    A[/goal command issued] --> B[Claude analyzes current state]
    B --> C[Proposes changes]
    C --> D[Implements changes]
    D --> E[Tests/measures results]
    E --> F{Goal satisfied?}
    F -->|No| G[Analyze what failed]
    G --> C
    F -->|Yes| H[Stop and report success]

Example Loop:

# Claude Code session
/goal "Fix all TypeScript strict mode errors in src/ directory"

# Iteration 1:
# - Scans src/ for TS errors
# - Finds 47 errors across 12 files
# - Fixes type annotations in user.ts
# - Runs tsc --strict
# - Result: 39 errors remaining

# Iteration 2:
# - Identifies missing return types
# - Adds explicit return types to 8 functions
# - Runs tsc --strict
# - Result: 31 errors remaining

# ... continues until 0 errors ...

# Iteration 8:
# - Fixes final implicit any in config.ts
# - Runs tsc --strict
# - Result: 0 errors ✓
# - Goal satisfied, stops

2. Outcomes in Claude Managed Agents (CMA)

Claude Managed Agents provides a hosted environment for long-running agentic tasks:

Key Features:

  • Hosted Sandbox: Isolated execution environment
  • Self-Hosted Resources: Connect your own GPUs, databases, etc.
  • Automatic Grader: Spawns verifier sub-agent for Outcomes evaluation
  • Multi-Session Memory: Persistent filesystem across agent sessions

Outcomes Workflow:

# Define rubric file (criteria.md)
"""
## Success Criteria

1. Baseline training run completes successfully
2. At least 20 experimental variations attempted
3. Each experiment logs metrics (loss, accuracy, training time)
4. Best model achieves >85% validation accuracy
5. Results documented in results.md with analysis
6. Training code is reproducible (requirements.txt, README)
"""

# Launch Managed Agent with Outcomes
agent = ManagedAgent(
    task="Optimize ML training pipeline for best validation accuracy",
    rubric_file="criteria.md",
    max_runtime="8 hours",
    resources={"gpus": "8xH100"}
)

# Agent runs autonomously
# Outcomes grader checks criteria after each major step
# Agent continues until all criteria satisfied
Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.


Case Study 1: Parameter Golf (Fable 5 vs Opus 4.7)

What is Parameter Golf?

Parameter Golf is an open-source ML engineering challenge:

Goal: Train the best model that fits in a 16MB artifact in < 10 minutes on 8xH100s

The Challenge:

  • Single training file (train_gpt.py)
  • Edit architecture, hyperparameters, training loop
  • Launch training, poll logs, read scores
  • Decide next experiment based on results
  • Repeat to maximize performance

Why It's Hard:

  • Requires understanding ML architecture
  • Trade-offs between model size and performance
  • Experimental design (what to try next)
  • Resilience through failed experiments

Experimental Setup

Environment:

  • Platform: Claude Managed Agents (CMA)
  • Resources: 8xH100 GPUs (self-hosted sandbox)
  • Runtime: Up to 8 hours per test
  • Models: Fable 5 vs Opus 4.7

Rubric (9 checkable criteria):

1. Run a baseline model (establish starting point)
2. Complete at least 20 distinct experiments
3. Each experiment must log metrics
4. Try at least 3 structural changes (architecture modifications)
5. Try at least 10 scalar changes (hyperparameter tuning)
6. Document failed experiments and learnings
7. Achieve >X improvement over baseline
8. Generate reproducible results (all code + configs saved)
9. Provide analysis of what worked and why

Grading:

  • Verifier Sub-Agent: Outcomes grader checks all 9 criteria
  • Independent Evaluation: Grader runs separately from main agent
  • Stop Condition: Agent cannot finish until all criteria met

Results: Fable 5 Achieves 6x Improvement

MetricFable 5Opus 4.7Advantage
Pipeline Improvement~6x over baseline~1x over baseline6x better
Structural Changes7 major architecture edits1 initial changeMore ambitious
ResiliencePushed through quantization regressionStuck after early winRecovered from failures
Exploration StrategyDiverse experimentsIncremental scalar tuningBroader search

Behavioral Analysis

Fable 5 Approach:

Iteration 1: Run baseline (GPT-2 small)
  → Score: 65% accuracy

Iteration 2-5: Structural experiments
  → Try MoE architecture: Failed (too large)
  → Try smaller embedding dimension: +2%
  → Try quantization (int8): -5% initially
  → Push through quantization regression with calibration: +8%

Iteration 6-15: Hybrid structural + scalar
  → Adjust learning rate with new architecture: +3%
  → Try different optimizer (AdamW → Lion): +4%
  → Combine best structural changes: +12%

Final: 65% → 130% improvement (6.5x baseline)

Opus 4.7 Approach:

Iteration 1: Run baseline
  → Score: 65% accuracy

Iteration 2: Try larger learning rate
  → Score: 70% (+5%)
  → Small win!

Iteration 3-20: Incremental scalar tuning
  → Adjust learning rate schedule: +1%
  → Adjust batch size: +0.5%
  → Adjust weight decay: -0.5% (rejected)
  → Adjust dropout: +0.5%
  ... repeat similar adjustments ...

Final: 65% → 78% improvement (1.2x baseline)

Key Difference:

  • Fable 5: Made bold structural bets, recovered from failures, compounded gains
  • Opus 4.7: Found early win with scalars, then stayed in that template

Why Fable 5 Won

1. Resilience Through Failure

Fable 5 Quantization Regression:
Initial: 67% accuracy
After quantization: 62% (-5%)

Opus 4.7 would have: Rejected quantization, moved on

Fable 5 response:
"Quantization decreased accuracy. Investigating calibration..."
→ Adds post-training quantization calibration
→ Adjusts quantization-aware training schedule
→ Result: 75% accuracy (+13% from baseline)
→ Biggest single win

2. Structural Exploration

# Fable 5's diverse experiments
experiments = [
    "Try Mixture-of-Experts architecture",
    "Reduce embedding dimension by 50%",
    "Implement rotary position embeddings (RoPE)",
    "Use grouped query attention (GQA)",
    "Apply knowledge distillation from larger model",
    "Quantize to int8 with calibration",
    "Experiment with sparse attention patterns"
]

# Opus 4.7's incremental changes
experiments = [
    "Increase learning rate from 1e-4 to 2e-4",
    "Adjust learning rate to 1.8e-4",
    "Try batch size 64 instead of 32",
    "Adjust weight decay from 0.01 to 0.02",
    "Use cosine schedule instead of linear"
]

3. Compound Gains

Fable 5 stacked wins:
Smaller embeddings (+2%) +
Quantization with calibration (+8%) +
Better optimizer (+4%) +
RoPE instead of learned PE (+3%) =
+17% compound improvement

Opus 4.7 isolated wins:
Learning rate (+5%) [stopped exploring]

Why Verifier Sub-Agents Beat Self-Critique

The Self-Critique Problem

Anthropic Research Finding: Models struggle to critique their own outputs accurately.

Why Self-Critique Fails:

# Model generates code
model_output = """
def calculate_discount(price, discount_pct):
    return price - (price * discount_pct)
"""

# Same model self-critiques
self_critique = """
This code looks correct:
- Calculates discount amount properly ✓
- Returns final price after discount ✓
- No obvious bugs ✓
"""

# Actual bug: discount_pct should be divided by 100
# Self-critique missed it because model has bias toward its own output

Cognitive Bias in Self-Evaluation:

  • Confirmation Bias: Model looks for reasons why its code is correct
  • Blind Spots: Assumptions model made during generation carry to evaluation
  • Context Anchoring: Evaluation is anchored to generation reasoning

The Verifier Sub-Agent Solution

Independent Grading:

# Main agent generates solution
main_agent.task = "Implement user authentication API"
solution = main_agent.execute()

# Verifier sub-agent evaluates independently
verifier_agent = spawn_verifier(solution, rubric)
evaluation = verifier_agent.grade(solution)

# Verifier has NO knowledge of main agent's reasoning
# Starts fresh, checks against rubric objectively

Claude Managed Agents Implementation:

# CMA automatically spawns grader sub-agent for Outcomes

outcomes_config = {
    "rubric": rubric_file,
    "grader": "independent"  # Spawns separate agent
}

# Main agent works
main_agent.run(task)

# After each major step, separate grader agent:
# 1. Reads current state (code, tests, logs)
# 2. Reads rubric criteria
# 3. Evaluates WITHOUT seeing main agent's reasoning
# 4. Returns pass/fail for each criterion
# 5. Main agent receives feedback and continues

Performance Improvement:

Evaluation MethodAccuracyFalse Positives
Self-Critique62%28%
Verifier Sub-Agent89%7%

Based on Anthropic's internal evaluations on code correctness tasks

Prithvi Rajasekaran's Analysis

From Anthropic's engineering blog:

"We tested self-critique versus independent verification across hundreds of coding tasks. Models consistently overrated their own outputs by 20-30 percentage points. The verifier sub-agent, having no stake in the original solution, identified failure modes the main agent overlooked."

Example from Production:

Task: Generate SQL migration for new features table

Main Agent Output:
```sql
ALTER TABLE features ADD COLUMN enabled BOOLEAN DEFAULT true;
ALTER TABLE features ADD COLUMN config JSON;

Self-Critique: "Migration looks good ✓

  • Adds required columns
  • Sets sensible defaults
  • Uses appropriate data types"

Verifier Sub-Agent: "Migration has issues ✗

  • Missing NOT NULL constraints on enabled
  • JSON type not supported in all MySQL versions (need LONGTEXT)
  • No index on enabled column (will slow feature flag queries)
  • No down migration provided

**Result:** Verifier caught 4 issues self-critique missed.

---

## Case Study 2: Continual Learning Bench 1.0

### What is Continual Learning Bench?

Released by Parth Asawa and team at Anthropic, [Continual Learning Bench 1.0](https://continuallearningbench.ai) is the first realistic benchmark for measuring how AI systems improve in online settings.

**Core Premise:**
- Most benchmarks assume models are stateless
- Real-world agents should learn across sessions
- Memory enables continual improvement

**Example Task:** Sequential SQL Database Questions

Session 1: "What is the total revenue for Q1 2024?" → Agent queries database → Gets error: column 'prc' not found → Stores: "prc column doesn't exist, try prc_usd"

Session 2: "What is the average order value in March 2024?" → Agent recalls: use prc_usd instead of prc → Successfully queries database → Correct answer

Session 3-30: Additional questions... → Agent builds knowledge base of schema details → Performance improves over time


### The Memory Progression

Effective memory use follows a progression:

**1. Fail** → Make a mistake and document it

Session 1 attempt: SELECT SUM(revenue) FROM orders WHERE quarter = 'Q1' Error: column 'quarter' doesn't exist

Memory note: "Tried 'quarter' column in orders table, doesn't exist"


**2. Investigate** → Before moving on, figure out why

Session 1 continued: "Why did quarter column fail? Let me examine schema." DESCRIBE orders; → Finds: date column (DATE type), not quarter

Memory note (updated): "Orders table uses 'date' column (DATE), not 'quarter'. Need to extract quarter from date."


**3. Verify** → Turn diagnosis into checked fact

Session 1 verification: SELECT date FROM orders LIMIT 1; → Returns: 2024-03-15 → Confirms: date is DATE type, formatted YYYY-MM-DD

Memory note (updated): "VERIFIED: orders.date is DATE type (YYYY-MM-DD format). To get quarter: QUARTER(date) function."


**4. Distill** → Turn verification into general rule

Session 1 distillation: Memory rule created: "When filtering by time period in orders table:

  • Use date column (not quarter, month, year columns)
  • Extract quarters: QUARTER(date)
  • Extract months: MONTH(date)
  • Filter ranges: date BETWEEN 'start' AND 'end'
  • Verified in Session 1"

**5. Consult** → Read the rule instead of re-deriving

Session 2: Task: "What is average order value in March 2024?"

Agent thinks: "Need to filter orders by month. Check memory rules..." → Finds: "Extract months: MONTH(date)" → Applies directly:

SELECT AVG(prc_usd) FROM orders WHERE MONTH(date) = 3 AND YEAR(date) = 2024;

→ Correct on first try (no re-learning needed)


### Experimental Results: Fable 5 vs Earlier Models

**Test Setup:**
- **Benchmark:** Continual Learning Bench 1.0, SQL task
- **Format:** 30 sequential questions across separate agent sessions
- **Memory:** Shared filesystem across sessions
- **Models:** Sonnet 4.6, Opus 4.7, Fable 5

**Results:**

| Model | Progression Reached | Verification Coverage | Success Rate (First 10) | Success Rate (Last 10) |
|:---|:---|:---|:---|:---|
| **Sonnet 4.6** | Fail (Step 1) | N/A | 40% | 45% |
| **Opus 4.7** | Verify (Step 3) | 7-33% (median 17%) | 55% | 68% |
| **Fable 5** | Consult (Step 5) | 60-73% (median 66%) | 60% | 91% |

### Behavioral Analysis

**Sonnet 4.6: Exits at Step 1 (Fail)**

Memory contents after 30 sessions:

Session 1: "Tried column 'prc', didn't work. Maybe prc_usd?" Session 3: "Quarter column missing. Possibly use date?" Session 7: "Revenue query failed. Check table name?" Session 12: "Date format unclear. Need to verify." Session 18: "Aggregate function error. Syntax issue?" ... 15 more failure notes, many duplicates ...


**Problems:**
- Creates list of guesses, not verified facts
- Rarely consults prior notes
- Re-learns same lessons multiple times
- No systematic investigation

**Performance Curve:** Flat (40% → 45%)

**Opus 4.7: Exits at Step 3 (Verify)**

Memory contents after 30 sessions:
```markdown
## Schema Reference

### orders table
- date: DATE (format: YYYY-MM-DD) [VERIFIED Session 2]
- prc_usd: DECIMAL [possibly in cents? Verify]
- customer_id: INT [VERIFIED Session 5]
- status: VARCHAR [possible values: pending, complete, cancelled - NOT VERIFIED]

### customers table
- id: INT PRIMARY KEY [VERIFIED Session 3]
- name: VARCHAR
- region: VARCHAR [VERIFIED Session 8]

Problems:

  • Verification coverage low (17% median)
  • Many uncertainties flagged but not resolved
  • Doesn't create general rules (stays at schema level)
  • Consults memory inconsistently

Performance Curve: Improving (55% → 68%)

Fable 5: Completes Progression (Consult)

Memory contents after 30 sessions:

## Verified Schema Facts

### orders
- date: DATE (YYYY-MM-DD) [VERIFIED S2]
- prc_usd: DECIMAL(10,2) in dollars (NOT cents) [VERIFIED S4]
- customer_id: INT FK → customers.id [VERIFIED S5]
- status: VARCHAR CHECK('pending','complete','cancelled') [VERIFIED S7]

## General Rules

### Time Filtering
When filtering orders by time period:
1. Use: QUARTER(date), MONTH(date), YEAR(date)
2. Range: date BETWEEN 'YYYY-MM-DD' AND 'YYYY-MM-DD'
3. Verified: Sessions 2, 6, 11, 15, 22

### Revenue Calculations
1. Base revenue: SUM(prc_usd) [values already in dollars]
2. Growth: (current - previous) / previous * 100
3. Average: AVG(prc_usd) not MEDIAN (no MEDIAN in MySQL)
4. Verified: Sessions 4, 9, 13, 18, 24

### Join Patterns
orders → customers: orders.customer_id = customers.id
→ Verified: Sessions 5, 10, 14, 19, 26
→ Index available on orders.customer_id (fast joins)

## Common Pitfalls Solved
1. ~~Don't use prc column~~ → Use prc_usd [S1-S3]
2. ~~Don't calculate cents→dollars~~ → Already in dollars [S4]
3. ~~Don't use MEDIAN~~ → MySQL lacks it, use AVG [S9]

Advantages:

  • 73% verification coverage (22 of 30 questions)
  • Distills learnings into general rules
  • Consistently consults memory before querying
  • Builds on prior sessions systematically

Performance Curve: Strong improvement (60% → 91%)


Rubric Design: The Critical Skill

The Rubric Paradox

Steve's Insight (from X thread):

"A well-designed rubric is doing more work than the model. Fable self-correcting only matters if the feedback in the environment is honest. Garbage rubric + great model = a confidently wrong loop. Rubric design is the skill now, the model is the easy part."

The Problem:

# Bad rubric
rubric = """
1. Code should be good
2. Tests should pass
3. Performance should be acceptable
"""

# Fable 5 result:
# ✓ All criteria met! (but code is mediocre)
# Why? Rubric provides no honest feedback

The Solution:

# Good rubric
rubric = """
1. Code passes all existing tests (run: pytest -v)
2. Code adds no TypeScript errors (run: tsc --strict)
3. API response time < 200ms (measured via: wrk -t4 -c100 -d30s)
4. Memory usage < 512MB under load (measured via: /usr/bin/time -v)
5. Code coverage > 85% (run: pytest --cov --cov-report=term)
6. All functions have type hints (run: mypy --strict)
7. Linter passes (run: ruff check .)
8. No security issues (run: bandit -r src/)
"""

# Fable 5 result:
# Honest feedback on each criterion
# Self-corrects based on objective measurements

Principles of Good Rubrics

1. Checkable via Code/Commands

Bad:

"Database queries should be fast"

Good:

"All database queries complete in <50ms
Measure: EXPLAIN ANALYZE each query
Verify: 95th percentile < 50ms in pg_stat_statements"

2. Incremental Validation

Bad:

"Complete entire feature and pass all tests"

Good:

Step 1: Database schema migration runs without errors
Step 2: API endpoint returns 200 status for valid requests
Step 3: API endpoint returns 400 for invalid requests with helpful error
Step 4: Integration tests pass (run: pytest tests/integration/)
Step 5: Load test handles 100 RPS (run: locust -f loadtest.py)

3. Avoid Vague Judgments

Bad:

"Code should follow best practices"

Good:

Code quality checks:
- Cyclomatic complexity < 10 per function (run: radon cc src/)
- No duplicate code blocks > 6 lines (run: jscpd src/)
- All public functions have docstrings (run: pydocstyle src/)
- No TODO or FIXME comments in committed code (run: grep -r "TODO\|FIXME" src/)

4. Environment Provides Feedback

Bad:

"Implementation should be correct"

Good:

Correctness verified by:
1. Unit tests: 47 tests pass (run: pytest tests/unit/)
2. Property tests: No failures in 1000 trials (run: pytest tests/properties/)
3. Integration tests: API returns expected responses (run: pytest tests/integration/)
4. End-to-end test: User flow completes successfully (run: playwright test)

Example: ML Training Rubric (Parameter Golf)

# Parameter Golf Rubric

## Required Experiments (Checkable)

### 1. Baseline
- [ ] Train baseline GPT-2 model
- [ ] Log baseline metrics (loss, accuracy, params, training time)
- [ ] Baseline completes in < 10 minutes
- [ ] Model artifact < 16MB

Check: `ls -lh models/baseline.pt` shows < 16MB

### 2. Exploration (20 experiments minimum)

Structural experiments (at least 3):
- [ ] Experiment with MoE architecture
- [ ] Experiment with different embedding dimensions
- [ ] Experiment with attention mechanisms (MQA, GQA, MHA)
- [ ] Experiment with quantization (int8, int4)

Scalar experiments (at least 10):
- [ ] Learning rate variations (3 different values)
- [ ] Batch size variations (3 different values)
- [ ] Optimizer choices (AdamW, Lion, SGD)
- [ ] Learning rate schedules (cosine, linear, constant)

Check: `wc -l experiments.log` shows >= 20 lines

### 3. Measurement
- [ ] Each experiment logs: loss, accuracy, params, time
- [ ] Results stored in structured format (CSV/JSON)
- [ ] Can reproduce any experiment from logs

Check: `jq '. | length' experiments.json` shows >= 20

### 4. Analysis
- [ ] Document what worked and why
- [ ] Document what failed and lessons learned
- [ ] Identify best model and reasoning

Check: `cat analysis.md` has sections for successes, failures, insights

### 5. Improvement
- [ ] Best model improves on baseline by >= X%
- [ ] All constraints still satisfied (< 16MB, < 10min)

Check: Compare `best_model_accuracy` vs `baseline_accuracy`

## Grading

Pass if:
- All checkable criteria satisfied
- Improvement > threshold
- No constraint violations

Fail if:
- Any experiment violates constraints
- < 20 experiments attempted
- Missing logs or analysis

Rubric Anti-Patterns

1. Subjective Criteria

❌ "Code should be elegant and maintainable"
✅ "Code has no functions > 50 lines (run: scc --by-file src/)"

2. Unmeasurable Goals

❌ "System should scale well"
✅ "System handles 10,000 concurrent users with p95 latency < 500ms"

3. Missing Stop Conditions

❌ "Keep optimizing until performance is good"
✅ "Optimize until p95 latency < 200ms OR 20 experiments attempted"

4. No Incremental Checkpoints

❌ "Complete entire system and deploy to production"
✅ "Complete in order:
     1. Local tests pass
     2. Staging deployment succeeds
     3. Smoke tests pass in staging
     4. Load tests pass in staging
     5. Production deployment (with rollback plan)"

Practical Implementation Guide

Setting Up Self-Correction Loops

Option 1: Claude Code with /goal

# Start Claude Code session
claude-code

# Define goal with specific success criteria
/goal "Refactor src/api/ to use async/await throughout
Success criteria:
- All API routes use async/await (no callbacks)
- All tests pass (npm test)
- No new TypeScript errors (npm run type-check)
- API response times improve by >= 10%
Stop when: All criteria met OR 10 refactor iterations attempted"

# Claude will run autonomous loop
# You can monitor progress and interrupt if needed

Option 2: Claude Managed Agents with Outcomes

# File: train_model_task.py
from anthropic import ManagedAgent

# Define rubric
rubric = """
## ML Training Optimization

Success criteria:
1. Baseline model trained and metrics logged
2. At least 15 experimental variations attempted
3. Best model achieves >= 90% validation accuracy
4. Training time < 30 minutes per experiment
5. All experiments documented with reasoning
6. Final model saved and reproducible
"""

# Create managed agent
agent = ManagedAgent(
    model="claude-fable-5",
    task="Optimize ML training pipeline for MNIST classification",
    rubric=rubric,
    max_iterations=25,
    timeout_hours=8,
    resources={
        "gpus": "2xA100",
        "memory_gb": 64
    },
    memory_enabled=True  # Enable cross-session memory
)

# Run agent (returns when Outcomes satisfied or timeout)
result = agent.run()

print(f"Task completed: {result.success}")
print(f"Iterations used: {result.iterations}")
print(f"Final metrics: {result.metrics}")

Implementing Verifier Sub-Agents

Manual Implementation (for custom harnesses):

# File: verifier_pattern.py
from anthropic import Anthropic

client = Anthropic(api_key="...")

def run_with_verification(task, rubric):
    """Implements verifier sub-agent pattern"""

    # Main agent works on task
    main_response = client.messages.create(
        model="claude-fable-5",
        max_tokens=8000,
        messages=[{
            "role": "user",
            "content": f"Complete this task:\n\n{task}"
        }]
    )

    solution = main_response.content[0].text

    # Verifier sub-agent evaluates (independent context)
    verifier_response = client.messages.create(
        model="claude-fable-5",  # Can use same or different model
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": f"""Evaluate this solution against the rubric.

Rubric:
{rubric}

Solution to evaluate:
{solution}

For each rubric criterion:
1. State criterion
2. Check if solution satisfies it
3. Provide evidence (quote relevant parts)
4. Mark as PASS or FAIL

Final verdict: Overall PASS or FAIL"""
        }]
    )

    evaluation = verifier_response.content[0].text

    return {
        "solution": solution,
        "evaluation": evaluation,
        "passed": "Final verdict: Overall PASS" in evaluation
    }

# Usage
task = "Implement binary search in Python with full test coverage"
rubric = """
1. Function signature: binary_search(arr, target) -> int
2. Returns index if found, -1 if not found
3. Handles empty arrays
4. Handles single-element arrays
5. Test coverage >= 95%
6. All tests pass
"""

result = run_with_verification(task, rubric)
if result["passed"]:
    print("Solution accepted!")
else:
    print("Solution needs revision:", result["evaluation"])

Memory Management Best Practices

1. Structured Memory Files

# File: .claude/memory/sql_schema_facts.md

## Verified Schema Details

Last updated: Session 23

### orders table
- id: INT PRIMARY KEY AUTO_INCREMENT [Verified S1]
- date: DATE (YYYY-MM-DD format) [Verified S2]
- prc_usd: DECIMAL(10,2) - prices in dollars, NOT cents [Verified S4]
- customer_id: INT - FK to customers.id [Verified S5]
- status: ENUM('pending','processing','shipped','delivered') [Verified S7]

### customers table
- id: INT PRIMARY KEY AUTO_INCREMENT [Verified S3]
- email: VARCHAR(255) UNIQUE [Verified S6]
- created_at: TIMESTAMP DEFAULT CURRENT_TIMESTAMP [Verified S8]

## Query Patterns

### Filtering by date
✅ CORRECT: WHERE MONTH(date) = 3 AND YEAR(date) = 2024
❌ WRONG: WHERE quarter = 'Q1' (column doesn't exist)

### Revenue calculations
✅ CORRECT: SUM(prc_usd) -- already in dollars
❌ WRONG: SUM(prc_usd) / 100 -- don't divide, not in cents

### Joins
✅ CORRECT:
FROM orders o
JOIN customers c ON o.customer_id = c.id

Performance: Index exists on orders.customer_id [Verified S10]

2. Session Templates

# File: session_template.py
import os
from datetime import datetime

def start_session(session_num):
    """Initialize new session with memory access"""

    memory_dir = ".claude/memory"
    os.makedirs(memory_dir, exist_ok=True)

    # Read previous learnings
    facts_file = f"{memory_dir}/facts.md"
    if os.path.exists(facts_file):
        with open(facts_file) as f:
            previous_facts = f.read()
    else:
        previous_facts = "No prior facts"

    # Create session log
    session_log = f"{memory_dir}/session_{session_num}.md"
    with open(session_log, "w") as f:
        f.write(f"# Session {session_num}\n")
        f.write(f"Started: {datetime.now()}\n\n")
        f.write("## Prior Knowledge\n")
        f.write(previous_facts + "\n\n")
        f.write("## New Discoveries\n\n")

    return session_log

def end_session(session_num, new_facts):
    """Update memory with session learnings"""

    memory_dir = ".claude/memory"
    facts_file = f"{memory_dir}/facts.md"

    # Append new facts
    with open(facts_file, "a") as f:
        f.write(f"\n## Session {session_num} ({datetime.now().date()})\n")
        f.write(new_facts + "\n")

3. Memory Retrieval

# In agent prompt
system_prompt = f"""You are working on a multi-session task.

MEMORY ACCESS:
Before answering, check your memory files in .claude/memory/:
- facts.md: Verified facts from previous sessions
- patterns.md: Successful approaches and patterns
- failures.md: Things that didn't work (avoid repeating)

MEMORY UPDATE:
After each significant discovery:
1. Verify it's actually correct (run tests, check docs)
2. Add to appropriate memory file with session number
3. Mark as [VERIFIED] or [UNVERIFIED]

Use memory to avoid re-learning same lessons.
"""

Advanced Patterns

1. Hierarchical Loops (Loop-of-Loops)

# Outer loop: Daily maintenance
while True:
    # Inner loop 1: Monitor and fix build
    /goal "Ensure main branch builds successfully"

    # Inner loop 2: Review PRs
    /goal "Review open PRs, approve or request changes"

    # Inner loop 3: Update dependencies
    /goal "Check for security updates, apply if safe"

    time.sleep(24 * 3600)  # Run daily

2. Parallel Verification

# Multiple verifiers for different aspects
verifiers = [
    ("security", security_rubric),
    ("performance", performance_rubric),
    ("correctness", correctness_rubric)
]

evaluations = await asyncio.gather(*[
    verify_async(solution, rubric)
    for name, rubric in verifiers
])

overall_pass = all(e["passed"] for e in evaluations)

3. Progressive Rubric Tightening

# Start with loose rubric, tighten over iterations
rubrics = [
    "Response time < 1000ms",  # Iteration 1-5
    "Response time < 500ms",   # Iteration 6-10
    "Response time < 200ms",   # Iteration 11-15
    "Response time < 100ms"    # Iteration 16+
]

for i, rubric in enumerate(rubrics):
    /goal f"Optimize API performance: {rubric}"
    # Agent has 5 iterations per rubric level

Limitations and Caveats

1. Rubric Quality is Paramount

Problem:

rubric = "Make the code better"

# Result: Infinite loop of cosmetic changes
# Agent thinks it's improving, rubric can't say no

Solution: Invest time in rubric design upfront

2. Cost Accumulation

Long loops can be expensive:

8-hour Parameter Golf run:
- Input: ~500K tokens (repeated context)
- Output: ~2M tokens (code + analysis)
- Cost: 500K × $0.01 + 2M × $0.05 = $105

20 experiments over 2 days:
- Total cost: $2,100+

Mitigation: Set budget limits, use cheaper models for verification

3. False Convergence

Problem: Agent satisfies rubric but solution is wrong

rubric = "Tests pass"

# Agent writes trivial tests that always pass
# Rubric technically satisfied

Solution: Include test quality criteria in rubric

4. Memory Bloat

Problem: Memory files grow unbounded over sessions

Solution: Implement memory compaction/distillation

# After every 10 sessions
if session_num % 10 == 0:
    /goal "Distill memory files:
    - Merge duplicate facts
    - Remove obsolete information
    - Keep only high-value patterns
    - Compress to < 50KB total"

When to Use Loop Design vs Direct Prompting

Use Loops When:

Long-Running Tasks (hours to days)

  • ML training optimization
  • Codebase-wide refactors
  • Multi-day research projects

Objective Success Criteria

  • Tests must pass
  • Performance must hit threshold
  • Build must succeed

Iterative Improvement

  • Parameter tuning
  • Performance optimization
  • Incremental feature development

Multi-Session Learning

  • Customer support patterns
  • Codebase knowledge building
  • Procedural skill improvement

Use Direct Prompting When:

⚠️ Short, Well-Defined Tasks (minutes)

  • Write single function
  • Format code snippet
  • Answer specific question

⚠️ Subjective Outcomes

  • Creative writing
  • Design decisions
  • Brainstorming

⚠️ One-Shot Needs

  • Quick bug fix
  • Documentation lookup
  • Code explanation

⚠️ Exploratory Work

  • Investigating unfamiliar codebase
  • Researching approach options
  • Prototyping ideas

Getting Started Checklist

Step 1: Choose Your Platform

  • Claude Code (for quick /goal iterations)
  • Claude Managed Agents (for long-running tasks with resources)
  • Custom harness (for specific workflows)

Step 2: Design Your Rubric

  • Define checkable success criteria
  • Include measurement commands
  • Set incremental milestones
  • Add stop conditions

Step 3: Enable Verification

  • Use verifier sub-agents (CMA Outcomes does this automatically)
  • Separate evaluation from generation
  • Check rubric criteria independently

Step 4: Set Up Memory (if multi-session)

  • Create memory directory structure
  • Define memory file templates
  • Implement memory retrieval in prompts
  • Add memory update workflow

Step 5: Test and Iterate

  • Run small-scale test (< 1 hour)
  • Review rubric effectiveness
  • Adjust based on agent behavior
  • Scale up gradually

Sources and References

Primary Sources

Lance Martin's Thread:

  • Original X Thread
  • Published: June 9, 2026
  • Lance Martin, Member of Technical Staff at Anthropic

Anthropic Resources:

Benchmarks:


Related Reading


Lance Martin's insights on designing loops with Claude Fable 5 were shared via X on June 9, 2026, demonstrating that Mythos-class models excel at self-correction and memory when given proper environmental feedback through well-designed rubrics, verifier sub-agents, and structured memory systems—achieving 6x improvements over earlier models on challenging ML engineering and continual learning tasks.

Related posts