on-call-handoff-patterns

wshobson/agents · updated Apr 8, 2026

$npx skills add https://github.com/wshobson/agents --skill on-call-handoff-patterns
0 commentsdiscussion
summary

Structured patterns and templates for seamless on-call shift handoffs with full context transfer.

  • Provides shift handoff document template covering active incidents, ongoing investigations, recent changes, known issues, and upcoming events
  • Includes recommended 30-minute overlap timing with split responsibilities for outgoing and incoming engineers
  • Offers quick async handoff template for rapid transitions and mid-incident handoff template for continuity during active incidents
  • Feat
skill.md

On-Call Handoff Patterns

Effective patterns for on-call shift transitions, ensuring continuity, context transfer, and reliable incident response across shifts.

When to Use This Skill

  • Transitioning on-call responsibilities
  • Writing shift handoff summaries
  • Documenting ongoing investigations
  • Establishing on-call rotation procedures
  • Improving handoff quality
  • Onboarding new on-call engineers

Core Concepts

1. Handoff Components

Component Purpose
Active Incidents What's currently broken
Ongoing Investigations Issues being debugged
Recent Changes Deployments, configs
Known Issues Workarounds in place
Upcoming Events Maintenance, releases

2. Handoff Timing

Recommended: 30 min overlap between shifts

Outgoing:
├── 15 min: Write handoff document
└── 15 min: Sync call with incoming

Incoming:
├── 15 min: Review handoff document
├── 15 min: Sync call with outgoing
└── 5 min: Verify alerting setup

Templates

Template 1: Shift Handoff Document

# On-Call Handoff: Platform Team

**Outgoing**: @alice (2024-01-15 to 2024-01-22)
**Incoming**: @bob (2024-01-22 to 2024-01-29)
**Handoff Time**: 2024-01-22 09:00 UTC

---

## 🔴 Active Incidents

### None currently active

No active incidents at handoff time.

---

## 🟡 Ongoing Investigations

### 1. Intermittent API Timeouts (ENG-1234)

**Status**: Investigating
**Started**: 2024-01-20
**Impact**: ~0.1% of requests timing out

**Context**:

- Timeouts correlate with database backup window (02:00-03:00 UTC)
- Suspect backup process causing lock contention
- Added extra logging in PR #567 (deployed 01/21)

**Next Steps**:

- [ ] Review new logs after tonight's backup
- [ ] Consider moving backup window if confirmed

**Resources**:

- Dashboard: [API Latency](https://grafana/d/api-latency)
- Thread: #platform-eng (01/20, 14:32)

---

### 2. Memory Growth in Auth Service (ENG-1235)

**Status**: Monitoring
**Started**: 2024-01-18
**Impact**: None yet (proactive)

**Context**:

- Memory usage growing ~5% per day
- No memory leak found in profiling
- Suspect connection pool not releasing properly

**Next Steps**:

- [ ] Review heap dump from 01/21
- [ ] Consider restart if usage > 80%

**Resources**:

- Dashboard: [Auth Service Memory](https://grafana/d/auth-memory)
- Analysis doc: [Memory Investigation](https://docs/eng-1235)

---

## 🟢 Resolved This Shift

### Payment Service Outage (2024-01-19)

- **Duration**: 23 minutes
- **Root Cause**: Database connection exhaustion
- **Resolution**: Rolled back v2.3.4, increased pool size
- **Postmortem**: [POSTMORTEM-89](https://docs/postmortem-89)
- **Follow-up tickets**: ENG-1230, ENG-1231

---

## 📋 Recent Changes

### Deployments

| Service      | Version | Time        | Notes                      |
| ------------ | ------- | ----------- | -------------------------- |
| api-gateway  | v3.2.1  | 01/21 14:00 | Bug fix for header parsing |
| user-service | v2.8.0  | 01/20 10:00 | New profile features       |
| auth-service | v4.1.2  | 01/19 16:00 | Security patch             |

### Configuration Changes

- 01/21: Increased API rate limit from 1000 to 1500 RPS
- 01/20: Updated database connection pool max from 50 to 75

### Infrastructure

- 01/20: Added 2 nodes to Kubernetes cluster
- 01/19: Upgraded Redis from 6.2 to 7.0

---

## ⚠️ Known Issues & Workarounds

### 1. Slow Dashboard Loading

**Issue**: Grafana dashboards slow on Monday mornings
**Workaround**: Wait 5 min after 08:00 UTC for cache warm-up
**Ticket**: OPS-456 (P3)

### 2. Flaky Integration Test

**Issue**: `test_payment_flow` fails intermittently in CI
**Workaround**: Re-run failed job (usually passes on retry)
**Ticket**: ENG-1200 (P2)

---

## 📅 Upcoming Events

| Date        | Event                | Impact              | Contact       |
| ----------- | -------------------- | ------------------- | ------------- |
| 01/23 02:00 | Database maintenance | 5 min read-only     | @dba-team     |
| 01/24 14:00 | Major release v5.0   | Monitor closely     | @release-team |
| 01/25       | Marketing campaign   | 2x traffic expected | @platform     |

---

## 📞 Escalation Reminders

| Issue Type      | First Escalation     | Second Escalation |
| --------------- | -------------------- | ----------------- |
| Payment issues  | @payments-oncall     | @payments-manager |
| Auth issues     | @auth-oncall         | @security-team    |
| Database issues | @dba-team            | @infra-manager    |
| Unknown/severe  | @engineering-manager | @vp-engineering   |

---

## 🔧 Quick Reference

### Common Commands

```bash
# Check service health
kubectl get pods -A | grep -v Running

# Recent deployments
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Database connections
psql -c "SELECT count(*) FROM pg_stat_activity;"

# Clear cache (emergency only)
redis-cli FLUSHDB
```

Important Links


Handoff Checklist

Outgoing Engineer

  • Document active incidents
  • Document ongoing investigations
  • List recent changes
  • Note known issues
  • Add upcoming events
  • Sync with incoming engineer

Incoming Engineer

  • Read this document
  • Join sync call
  • Verify PagerDuty is routing to you
  • Verify Slack notifications working
  • Check VPN/access working
  • Review critical dashboards

### Template 2: Quick Handoff (Async)

```markdown
# Quick Handoff: @alice → @bob

## TL;DR
- No active incidents
- 1 investigation ongoing (API timeouts, see ENG-1234)
- Major release tomorrow (01/24) - be ready for issues

## Watch List
1. API latency around 02:00-03:00 UTC (backup window)
2. Auth service memory (restart if > 80%)

## Recent
- Deployed api-gateway v3.2.1 yesterday (stable)
- Increased rate limits to 1500 RPS

## Coming Up
- 01/23 02:00 - DB maintenance (5 min read-only)
- 01/24 14:00 - v5.0 release

## Questions?
I'll be available on Slack until 17:00 today.

Template 3: Incident Handoff (Mid-Incident)

# INCIDENT HANDOFF: Payment Service Degradation

**Incident Start**: 2024-01-22 08:15 UTC
**Current Status**: Mitigating
**Severity**: SEV2

---

## Current State

- Error rate: 15% (down from 40%)
- Mitigation in progress: scaling up pods
- ETA to resolution: ~30 min

## What We Know

1. Root cause: Memory pressure on payment-service pods
2. Triggered by: Unusual traffic spike (3x normal)
3. Contributing: Inefficient query in checkout flow

## What We've Done

- Scaled payment-service from 5 → 15 pods
- Enabled rate limiting on checkout endpoint
- Disabled non-critical features

## What Needs to Happen

1. Monitor error rate - should reach <1% in ~15 min
2. If not improving, escalate to @payments-manager
3. Once stable, begin root cause investigation

## Key People

- Incident Commander: @alice (handing off)
- Comms Lead: @charlie
- Technical Lead: @bob (incoming)

## Communication

- Status page: Updated at 08:45
- Customer support: Notified
- Exec team: Aware

## Troubleshooting

**Incoming engineer misses a critical issue because the handoff document was incomplete.**
Use the outgoing checklist as a gate: do not mark handoff complete until every section has at least one entry (or an explicit "none"). Make incomplete handoffs a blameless postmortem action item.

**A 30-minute sync call is not possible due to timezone gaps.**
Fall back to the async quick handoff template (Template 2). Supplement with a short Loom or voice memo walking through the watch list. Ensure the incoming engineer has a direct contact method if they have follow-up questions.

**The incoming engineer inherits a mid-incident and is immediately overwhelmed.**
Use the incident handoff template (Template 3) specifically. The outgoing engineer should remain available on Slack for 15 minutes after handoff, even if off-call, to answer clarifying questions.

**On-call handoff documents are inconsistently formatted across teams.**
Adopt the shift handoff template organization-wide and store completed handoffs in a shared location (wiki, Notion, Confluence). Link each handoff from the on-call schedule entry in PagerDuty.

**Incoming engineer cannot verify their alerting is working before the outgoing engineer logs off.**
Add a standard step: outgoing engineer fires a test alert and confirms incoming engineer receives it in PagerDuty and Slack before ending the overlap window.

## Related Skills

- [incident-classification](../../skills/incident-classification/SKILL.md) — Classify and prioritize incidents that need to be included in the handoff document
- [postmortem-facilitation](../../skills/postmortem-facilitation/SKILL.md) — Turn resolved incidents from the shift into structured postmortems

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.568 reviews
  • Valentina Taylor· Dec 24, 2024

    on-call-handoff-patterns fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • William Farah· Dec 12, 2024

    on-call-handoff-patterns is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Hassan Verma· Dec 8, 2024

    Keeps context tight: on-call-handoff-patterns is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Pratham Ware· Dec 4, 2024

    on-call-handoff-patterns fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Ishan Desai· Dec 4, 2024

    Useful defaults in on-call-handoff-patterns — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Amelia Liu· Nov 27, 2024

    I recommend on-call-handoff-patterns for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Yash Thakker· Nov 23, 2024

    Registry listing for on-call-handoff-patterns matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Soo Menon· Nov 23, 2024

    on-call-handoff-patterns is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • William Flores· Nov 23, 2024

    Solid pick for teams standardizing on skills: on-call-handoff-patterns is focused, and the summary matches what you get after install.

  • Zara Martin· Nov 19, 2024

    on-call-handoff-patterns has been reliable in day-to-day use. Documentation quality is above average for community skills.

showing 1-10 of 68

1 / 7