Confirm successful installation by checking the skill directory location:
.cursor/skills/postmortem-writing
Restart Cursor to activate postmortem-writing. Access via /postmortem-writing in your agent's command palette.
β
Security Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your environment. Always review source, verify the publisher, and test in isolation before production.
Comprehensive guide to writing effective, blameless postmortems that drive organizational learning and prevent incident recurrence.
Do not use this skill when
The task is unrelated to postmortem writing
You need a different domain or tool outside this scope
Instructions
Clarify goals, constraints, and required inputs.
Apply relevant best practices and validate outcomes.
Provide actionable steps and verification.
If detailed examples are required, open resources/implementation-playbook.md.
Use this skill when
Conducting post-incident reviews
Writing postmortem documents
Facilitating blameless postmortem meetings
Identifying root causes and contributing factors
Creating actionable follow-up items
Building organizational learning culture
Core Concepts
1. Blameless Culture
Blame-Focused
Blameless
"Who caused this?"
"What conditions allowed this?"
"Someone made a mistake"
"The system allowed this mistake"
Punish individuals
Improve systems
Hide information
Share learnings
Fear of speaking up
Psychological safety
2. Postmortem Triggers
SEV1 or SEV2 incidents
Customer-facing outages > 15 minutes
Data loss or security incidents
Near-misses that could have been severe
Novel failure modes
Incidents requiring unusual intervention
Quick Start
Postmortem Timeline
Day 0: Incident occurs
Day 1-2: Draft postmortem document
Day 3-5: Postmortem meeting
Day 5-7: Finalize document, create tickets
Week 2+: Action item completion
Quarterly: Review patterns across incidents
Templates
Template 1: Standard Postmortem
# Postmortem: [Incident Title]**Date**: 2024-01-15
**Authors**: @alice, @bob
**Status**: Draft | In Review | Final
**Incident Severity**: SEV2
**Incident Duration**: 47 minutes
## Executive SummaryOn January 15, 2024, the payment processing service experienced a 47-minute outage affecting approximately 12,000 customers. The root cause was a database connection pool exhaustion triggered by a configuration change in deployment v2.3.4. The incident was resolved by rolling back to v2.3.3 and increasing connection pool limits.
**Impact**:
- 12,000 customers unable to complete purchases
- Estimated revenue loss: $45,000
- 847 support tickets created
- No data loss or security implications
## Timeline (All times UTC)| Time | Event ||------|-------|| 14:23 | Deployment v2.3.4 completed to production || 14:31 | First alert: `payment_error_rate > 5%`|| 14:33 | On-call engineer @alice acknowledges alert || 14:35 | Initial investigation begins, error rate at 23% || 14:41 | Incident declared SEV2, @bob joins || 14:45 | Database connection exhaustion identified || 14:52 | Decision to rollback deployment || 14:58 | Rollback to v2.3.3 initiated || 15:10 | Rollback complete, error rate dropping || 15:18 | Service fully recovered, incident resolved |## Root Cause Analysis### What HappenedThe v2.3.4 deployment included a change to the database query pattern that inadvertently removed connection pooling for a frequently-called endpoint. Each request opened a new database connection instead of reusing pooled connections.
### Why It Happened1.**Proximate Cause**: Code change in `PaymentRepository.java` replaced pooled `DataSource` with direct `DriverManager.getConnection()` calls.
2.**Contributing Factors**:
- Code review did not catch the connection handling change
- No integration tests specifically for connection pool behavior
- Staging environment has lower traffic, masking the issue
- Database connection metrics alert threshold was too high (90%)
3.**5 Whys Analysis**:
- Why did the service fail? β Database connections exhausted
- Why were connections exhausted? β Each request opened new connection
- Why did each request open new connection? β Code bypassed connection pool
- Why did code bypass connection pool? β Developer unfamiliar with codebase patterns
- Why was developer unfamiliar? β No documentation on connection management patterns
### System Diagram
[Client] β [Load Balancer] β [Payment Service] β [Database]
β
Connection Pool (broken)
β
Direct connections (cause)
## Detection
### What Worked
- Error rate alert fired within 8 minutes of deployment
- Grafana dashboard clearly showed connection spike
- On-call response was swift (2 minute acknowledgment)
### What Didn't Work
- Database connection metric alert threshold too high
- No deployment-correlated alerting
- Canary deployment would have caught this earlier
### Detection Gap
The deployment completed at 14:23, but the first alert didn't fire until 14:31 (8 minutes). A deployment-aware alert could have detected the issue faster.
## Response
### What Worked
- On-call engineer quickly identified database as the issue
- Rollback decision was made decisively
- Clear communication in incident channel
### What Could Be Improved
- Took 10 minutes to correlate issue with recent deployment
- Had to manually check deployment history
- Rollback took 12 minutes (could be faster)
## Impact
### Customer Impact
- 12,000 unique customers affected
- Average impact duration: 35 minutes
- 847 support tickets (23% of affected users)
- Customer satisfaction score dropped 12 points
### Business Impact
- Estimated revenue loss: $45,000
- Support cost: ~$2,500 (agent time)
- Engineering time: ~8 person-hours
### Technical Impact
- Database primary experienced elevated load
- Some replica lag during incident
- No permanent damage to systems
## Lessons Learned
### What Went Well
1. Alerting detected the issue before customer reports
2. Team collaborated effectively under pressure
3. Rollback procedure worked smoothly
4. Communication was clear and timely
### What Went Wrong
1. Code review missed critical change
2. Test coverage gap for connection pooling
3. Staging environment doesn't reflect production traffic
4. Alert thresholds were not tuned properly
### Where We Got Lucky
1. Incident occurred during business hours with full team available
2. Database handled the load without failing completely
3. No other incidents occurred simultaneously
## Action Items
| Priority | Action | Owner | Due Date | Ticket |
|----------|--------|-------|----------|--------|
| P0 | Add integration test for connection pool behavior | @alice | 2024-01-22 | ENG-1234 |
| P0 | Lower database connection alert threshold to 70% | @bob | 2024-01-17 | OPS-567 |
| P1 | Document connection management patterns | @alice | 2024-01-29 | DOC-89 |
| P1 | Implement deployment-correlated alerting | @bob | 2024-02-05 | OPS-568 |
| P2 | Evaluate canary deployment strategy | @charlie | 2024-02-15 | ENG-1235 |
| P2 | Load test staging with production-like traffic | @dave | 2024-02-28 | QA-123 |
## Appendix
### Supporting Data
#### Error Rate Graph
[Link to Grafana dashboard snapshot]
#### Database Connection Graph
[Link to metrics]
### Related Incidents
- 2023-11-02: Similar connection issue in User Service (POSTMORTEM-42)
### References
- Connection Pool Best Practices
- Deployment Runbook
Template 2: 5 Whys Analysis
# 5 Whys Analysis: [Incident]## Problem StatementPayment service experienced 47-minute outage due to database connection exhaustion.
## Analysis### Why #1: Why did the service fail?**Answer**: Database connections were exhausted, causing all new requests to fail.
**Evidence**: Metrics showed connection count at 100/100 (max), with 500+ pending requests.
---### Why #2: Why were database connections exhausted?**Answer**: Each incoming request opened a new database connection instead of using the connection pool.
**Evidence**: Code diff shows direct `DriverManager.getConnection()` instead of pooled `DataSource`.
---### Why #3: Why did the code bypass the connection pool?**Answer**: A developer refactored the repository class and inadvertently changed the connection acquisition method.
**Evidence**: PR #1234 shows the change, made while fixing a different bug.
---### Why #4: Why wasn't this caught in code review?**Answer**: The reviewer focused on the functional change (the bug fix) and didn't notice the infrastructure change.
**Evidence**: Review comments only discuss business logic.
---### Why #5: Why isn't there a safety net for this type of change?**Answer**: We lack automated tests that verify connection pool behavior and lack documentation about our connection patterns.
**Evidence**: Test suite has no tests for connection handling; wiki has no article on database connections.
β
Make data-driven prioritization decisions faster
Stakeholder Communication
Draft PRDs, status updates, and stakeholder presentations
βΊAccess to product documentation and roadmap tools (Jira, Notion, etc.)
βΊUnderstanding of product management frameworks (RICE, Jobs-to-be-Done, etc.)
βΊStakeholder contact information and communication channels
Time Estimate
30-60 minutes to see productivity improvements
Steps
1Install product management skill
2Start with user story generation for known feature
3Progress to competitive analysis: research 2-3 competitors
4Use for roadmap prioritization: apply RICE/ICE scoring
5Draft stakeholder communications and refine based on feedback
6Build template library for recurring PM tasks
7Share effective prompts with product team
Common Pitfalls
β Not validating competitive researchβverify facts before sharing
β Accepting user stories without involving engineering team
β Over-relying on frameworks without qualitative judgment
β Not customizing outputs to company culture and communication style
β Skipping stakeholder validation of generated requirements
Best Practices
β Do
+Validate research and competitive analysis with real data
+Collaborate with engineering when generating technical requirements
+Customize frameworks and templates to your company context
+Use skill for first drafts, refine with stakeholder input
+Document successful prompt patterns for PM tasks
+Combine AI efficiency with human judgment and intuition
β Don't
βDon't publish competitive analysis without fact-checking
βDon't finalize user stories without engineering review
βDon't make prioritization decisions solely on AI scoring
βDon't skip customer validation of generated requirements
βDon't ignore company-specific context and culture
π‘ Pro Tips
β Provide context: company goals, constraints, customer feedback
β Ask for alternatives: 'Show 3 ways to prioritize this roadmap'
β Request stakeholder-specific formatting: 'Executive summary vs. engineering spec'
β Use skill for 70% generation + 30% customization to company needs
When to Use This
β Use when
Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work.
β Avoid when
Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed.
Learning Path
1Basic: user stories, feature specs, status updates