Error Propagation in Multi-Agent Systems: Structured Context Over Generic Failures

explainx.ainewsletter3.5k

Error Propagation in Multi-Agent Systems: Structured Context Over Generic Failures | explainx.ai Blog | explainx.ai

In a multi-agent system, the coordinator's ability to recover from failures depends entirely on what the subagent tells it about the failure. A generic status message — "search unavailable" or "operation failed" — gives the coordinator no basis for choosing a recovery path. Structured error context is not a nice-to-have; it is the interface contract between subagents and coordinators.

This is the core subject of Domain 5 of the Claude Certified Architect – Foundations exam (Context Management & Reliability, 15% weight). Task Statements 5.3-5.4 test your ability to design error propagation that enables intelligent recovery and to manage context across long multi-agent sessions.

The three anti-patterns

The exam identifies three anti-patterns that appear frequently in agentic system designs:

Anti-pattern 1: Returning empty results as success

json

{
  "status": "success",
  "results": []
}

An empty result set looks identical to a successful search that found nothing. The coordinator cannot distinguish "no results exist" from "the search failed to execute." This produces silent data gaps in synthesized outputs — the coordinator proceeds as if the search ran successfully, creating a report with genuine missing coverage and no indication that coverage is missing.

Anti-pattern 2: Terminating the entire workflow on a single subagent failure

A subagent that raises an unhandled exception or returns an unstructured error forces the coordinator to abort the entire workflow. For a research system with five subagents, one failing PDF parser should not terminate the web search, document analysis, and synthesis passes. Subagents should handle failures locally where possible and return structured failure context when they cannot continue — not propagate exceptions to the coordinator.

Anti-pattern 3: Generic status strings

json

{ "status": "error", "message": "External service unavailable" }

"External service unavailable" tells the coordinator nothing it can act on. Is this transient? Should it retry? Is there an alternative data source? Does the user need to be notified? Generic strings produce generic responses — the coordinator falls back to a default recovery path regardless of what actually happened.

What structured error context must include (Task Statement 5.3)

Every subagent failure response should include four elements that enable coordinator-level decision making:

json

{
  "status": "failure",
  "errorCategory": "RATE_LIMIT",
  "isRetryable": true,
  "attemptedQuery": "Q3 2025 earnings reports for semiconductor companies",
  "partialResults": [
    {
      "source": "internal_docs",
      "confidence": "high",
      "excerpt": "Intel Q3 2025 guidance from internal analysis..."
    }
  ],
  "potentialAlternatives": [
    {
      "approach": "search_internal_docs",
      "rationale": "Internal document store may have relevant historical earnings data"
    },
    {
      "approach": "retry_after_delay",
      "retryAfterSeconds": 30,
      "rationale": "Rate limit resets on 30-second window"

Each field serves a specific coordinator purpose:

errorCategory: Machine-readable type that enables coordinator routing logic without string parsing. A coordinator can switch on error categories to choose recovery paths.
isRetryable: Boolean that prevents unnecessary retries on permanent failures (permission denied, validation errors) and enables appropriate retry on transient failures (rate limits, timeouts).
attemptedQuery: Prevents the coordinator from asking the same subagent to attempt the same query. Without this, a coordinator under context pressure may rediscover the same failed approach.
partialResults: What was retrieved before the failure. Coordinator can include this in synthesis with a coverage gap annotation rather than treating the entire subagent output as void.
potentialAlternatives: Concrete alternative approaches. The coordinator does not need to reason about recovery from scratch — the subagent that ran the operation knows what alternatives exist.
coverageGap: What is missing from this subagent's output. Used in synthesis output to annotate gaps explicitly rather than silently omitting coverage.

Error categories and their recovery implications

The exam tests four error categories with different coordinator responses:

Category	`isRetryable`	Coordinator response
`TRANSIENT` (timeout, network, rate limit)	`true`	Retry after delay; use `retryAfterSeconds` if provided
`VALIDATION` (malformed input, schema mismatch)	`false`	Fix input before retrying; escalate if fix is not possible
`BUSINESS` (no results found, out of scope)	`false`	Proceed with partial results; annotate coverage gap
`PERMISSION` (access denied, auth failure)	`false`	Escalate to human; do not retry with same credentials

The isRetryable field is the first branch in coordinator recovery logic. A coordinator that retries a PERMISSION failure wastes latency and API cost. A coordinator that does not retry a TRANSIENT failure degrades system reliability unnecessarily.

The business error category is the most nuanced. An empty result set from a database query is a business error — it is not a failure of the system, it is a valid answer ("no results matching your criteria"). The coordinator should treat this differently from a transient failure: proceed with what is available, annotate the gap, and report to the user that specific coverage was unavailable rather than retrying.

Local recovery before propagation

Subagents should attempt local recovery before returning failure context to the coordinator. The sequence:

Primary approach fails → try fallback approach locally
Fallback also fails → try simplified query (reduced scope)
All local options exhausted → return structured failure with partial results and alternatives

This prevents the coordinator from being involved in routine recovery decisions that subagents are better positioned to handle. A web search subagent that fails on a complex query should try a simpler version locally — not immediately propagate the failure and wait for the coordinator to decide to retry with a simpler query.

The exam tests this in the multi-agent research system scenario: a coordinator delegates to web search, document analysis, and synthesis. When web search fails, what should happen? The subagent should:

Retry with simplified query
Try alternative search strategy (different keywords, different time range)
Return structured failure with partial results if both fail

What the subagent should NOT do: propagate the first failure immediately to the coordinator, leaving the coordinator to reason about search query reformulation it is not positioned to perform.

Distinguishing access failures from valid empty results (Task Statement 5.4)

The structural difference between "nothing found" and "search failed" must be explicit in the response:

json

// Valid empty result — search ran successfully
{
  "status": "success",
  "results": [],
  "searchExecuted": true,
  "queryInterpreted": "semiconductor earnings Q3 2025",
  "note": "No matching documents found in the indexed corpus"
}

// Access failure — search did not run
{
  "status": "failure",
  "errorCategory": "PERMISSION",
  "isRetryable": false,
  "searchExecuted": false,
  "coverageGap": "External earnings data not retrieved due to API access failure"
}

The searchExecuted flag is the minimum needed to make this distinction clear without coordinator-level string parsing. A coordinator that receives results: [] with searchExecuted: true knows the search ran and found nothing — a valid business outcome. searchExecuted: false signals that coverage is missing due to a system issue.

In synthesis, the coordinator annotates these differently:

results: [], searchExecuted: true → "No external results found for this query"
searchExecuted: false → "External search coverage unavailable — results reflect internal sources only"

The downstream reader (human or another system) gets accurate coverage information either way.

Coverage gap annotations in synthesis output

When a coordinator produces a final synthesized report from multiple subagents, partial failures must be surfaced in the output itself — not hidden in logs or left as silent omissions:

markdown

## Q3 2025 Semiconductor Market Analysis

### Key Findings

[Synthesis content based on available data...]

---

### Coverage Notes

- **External earnings data**: Retrieved for Intel and TSMC only. NVIDIA, AMD, Qualcomm 
  external data unavailable due to rate limit at time of analysis. Internal estimates 
  used for those companies (confidence: medium).
  
- **Market share data**: Full coverage from internal document store.

- **Recent news**: Limited to articles indexed before June 15, 2026. Events after 
  that date not reflected in this analysis.

Coverage gap annotations transform silent omissions into explicit, auditable limitations. The reader knows what the report covers, what it does not cover, and why. This is the difference between a report that looks complete and a report that is honest about its completeness.

The exam frames this as a reliability requirement: "An analysis report is produced from five subagents, two of which partially failed. How should the final report handle the missing coverage?" The answer is explicit coverage gap annotations in the report body, not just error logging in the system.

Context management for long multi-agent sessions (Task Statement 5.4)

Multi-agent research tasks can accumulate large contexts across many tool calls and subagent exchanges. Two patterns that fail at scale:

Progressive summarization without anchoring: Summarizing previous steps as the context grows sounds reasonable until you summarize away a detail that later turns out to be critical. The summary loses the exact wording of a source that the synthesis pass needs to cite accurately.

Lost-in-the-middle at scale: Claude's attention is less reliable on content in the middle of a very long context. Critical subagent outputs in the middle of a 100k-token coordinator context get less attention than outputs at the beginning or end.

The correct pattern for long sessions: scratchpad files. Instead of accumulating all subagent outputs in the conversation context, write intermediate results to structured files and reference them explicitly:

snippet

coordinator-session/
  search-results-web.json      # Web search subagent output
  search-results-internal.json # Internal docs subagent output
  analysis-notes.md            # Synthesis subagent working notes
  coverage-gaps.json           # Accumulated coverage gap records
  final-report.md              # Output being assembled

The coordinator's context contains references to these files rather than the full content. Specific content is read into context only when needed for the current step. This keeps coordinator context lean and prevents both progressive summarization loss and lost-in-the-middle degradation.

What the exam tests in Domain 5

Task Statements 5.3-5.4 map to:

5.3: Structured error context design — what fields enable coordinator recovery, error categories and isRetryable, local recovery before propagation
5.4: Coverage gap handling in synthesis, distinguishing access failures from valid empty results, context management for long sessions (progressive summarization risks, scratchpad files)

The multi-agent research system scenario is the primary exam frame for Domain 5: coordinator, web search subagent, document analysis subagent, synthesis subagent, and report generation. Questions present failure scenarios and ask you to identify the correct error response design, recovery path, or synthesis annotation.

Key takeaways

Returning empty results as success prevents the coordinator from knowing coverage is missing. Use searchExecuted or isError to distinguish.
Terminate single subagent failures locally, not the entire workflow. Return structured partial results with coverage gap context.
Generic error strings give coordinators nothing to act on. Structured error responses with errorCategory, isRetryable, attemptedQuery, partialResults, and potentialAlternatives enable intelligent recovery.
Four error categories: transient (retry), validation (fix input), business (proceed with partial), permission (escalate).
Subagents should attempt local recovery (fallback, simplified query) before propagating failure to the coordinator.
Annotate coverage gaps explicitly in synthesis output — do not silently omit failed subagent contributions.
For long multi-agent sessions, use scratchpad files to prevent context bloat and lost-in-the-middle degradation.

This is a core topic in Domain 5 of the Claude Certified Architect – Foundations exam. Drill the error propagation scenarios with CCA practice tests on explainx.ai — Domain 5 questions often combine error design with synthesis output decisions in a single scenario.

Exam domain weights and task statements are based on the Claude Certified Architect – Foundations Certification Exam Guide published by Anthropic Academy. Verify current content on Anthropic Academy before your exam date.

Related posts

MCP Tool Descriptions: How to Write Them for Reliable Agent Selection

Types of AI Agents: Complete Taxonomy and When to Use Each (2026)

Multi-Agent Orchestration Patterns: A Production Guide (2026)

The three anti-patterns

What structured error context must include (Task Statement 5.3)

Error categories and their recovery implications

Local recovery before propagation

Distinguishing access failures from valid empty results (Task Statement 5.4)

Coverage gap annotations in synthesis output

Context management for long multi-agent sessions (Task Statement 5.4)

What the exam tests in Domain 5

Key takeaways