Production Incident Resolver
A coding agent loop designed to diagnose and resolve production incidents through iterative investigation, targeted fixes, and continuous health monitoring until system stability is restored.
Goal
Resolve production incident
How to Run
Provide a production incident description to this loop. The agent will iteratively investigate, apply fixes, and validate system health until the exit condition is met or max iterations are reached.
- 01
Initiate Incident Response
Run the kickoff prompt with a detailed description of the production issue including any error messages, affected components, and initial observations.
- 02
Monitor Progress
The agent will automatically execute check commands after each action. Review results and approve/deny proposed changes to ensure safe resolution.
- 03
Verify Resolution
Once monitoring shows healthy status, confirm the fix addresses root cause and doesn't introduce new issues.
Workflow Steps
- 01
Understand incident context and scope from user-provided description
Validate understanding with clarifying questions
- 02
Triage by identifying most likely root causes and failure points
Prioritize potential issues based on impact and evidence
- 03
Investigate logs, metrics, and recent changes to confirm diagnosis
Analyze monitoring data and correlate with symptoms
- 04
Implement targeted fix for identified root cause
Apply change and run health check command
- 05
Validate fix doesn't break other functionality
Run comprehensive tests and monitor for regressions
- 06
Document resolution and update incident records
Confirm knowledge capture and prepare handoff notes
Kickoff Prompt
Start the "Production Incident Resolver" loop. Goal: Resolve production incident Max iterations: 10 Between iterations run: health check Exit when: Monitoring healthy I'm experiencing a production incident. Here's what I know so far: [DESCRIBE INCIDENT]. Please guide me through resolving this systematically while keeping our services stable. Self-pace this loop. After each iteration, run `health check` and evaluate the output, and only continue if the exit condition is not met (Monitoring healthy). Stop when the exit condition passes or 10 iterations are reached. Give a short status update each pass.
Guardrails
hardcoded- ·No code changes without explicit user approval for production systems
- ·Avoid accessing or modifying sensitive configuration/data
- ·All proposed fixes must pass static analysis before implementation
- ·Maintain full context of previous actions to prevent redundant work
- ·Stop and escalate if max iterations reached without resolution
Flow Diagram
Related loops — Debugging
Debugging
Reproduce and Fix
This loop guides you through reproducing a reported bug, identifying its root cause, implementing a fix, and verifying the solution through automated testing. The agent will iteratively work to resolve the issue while maintaining system integrity.
Debugging
Error Log Reduction
This loop analyzes application error logs to identify and fix recurring errors, reducing their frequency over time through iterative debugging and targeted code improvements.
Debugging
Root Cause Finder
A systematic loop for identifying the root cause of code issues, bugs, or unexpected behavior through iterative investigation and analysis, ensuring developers address foundational problems rather than surface-level symptoms.