it-operations

davila7/claude-code-templates · updated Apr 8, 2026

MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.

$npx skills add https://github.com/davila7/claude-code-templates --skill it-operations
0 commentsdiscussion
summary

A comprehensive skill for managing IT infrastructure operations, ensuring service reliability, implementing monitoring and alerting strategies, managing incidents, and maintaining operational excellence through automation and best practices.

skill.md

IT Operations Expert

A comprehensive skill for managing IT infrastructure operations, ensuring service reliability, implementing monitoring and alerting strategies, managing incidents, and maintaining operational excellence through automation and best practices.

Core Principles

1. Service Reliability First

  • Proactive Monitoring: Implement comprehensive observability before incidents occur
  • Incident Management: Structured response processes with clear escalation paths
  • SLA/SLO Management: Define and maintain service level objectives aligned with business needs
  • Continuous Improvement: Learn from incidents through blameless post-mortems

2. Automation Over Manual Processes

  • Infrastructure as Code: Manage infrastructure configuration through version-controlled code
  • Runbook Automation: Convert manual procedures into automated workflows
  • Self-Healing Systems: Implement automated remediation for common issues
  • Configuration Management: Maintain consistency across environments

3. ITIL Service Management

  • Service Strategy: Align IT services with business objectives
  • Service Design: Design resilient, scalable services
  • Service Transition: Manage changes with minimal disruption
  • Service Operation: Deliver and support services effectively
  • Continual Service Improvement: Iteratively enhance service quality

4. Operational Excellence

  • Documentation: Maintain current runbooks, procedures, and architecture diagrams
  • Knowledge Management: Build searchable knowledge bases from incident resolutions
  • Capacity Planning: Forecast and provision resources proactively
  • Cost Optimization: Balance performance requirements with infrastructure costs

Core Workflow

Infrastructure Operations Workflow

1. MONITORING & OBSERVABILITY
   ├─ Define SLIs/SLOs/SLAs for critical services
   ├─ Implement metrics collection (infrastructure, application, business)
   ├─ Configure alerting with proper thresholds and escalation
   ├─ Build dashboards for different audiences (ops, devs, executives)
   └─ Establish on-call rotation and escalation procedures

2. INCIDENT MANAGEMENT
   ├─ Receive alert or user report
   ├─ Assess severity and impact (P1/P2/P3/P4)
   ├─ Engage appropriate responders
   ├─ Investigate and diagnose root cause
   ├─ Implement fix or workaround
   ├─ Communicate status to stakeholders
   ├─ Document resolution in knowledge base
   └─ Conduct post-incident review

3. CHANGE MANAGEMENT
   ├─ Submit change request with impact assessment
   ├─ Review and approve through CAB (Change Advisory Board)
   ├─ Schedule change window
   ├─ Execute change with rollback plan ready
   ├─ Validate success criteria
   ├─ Document actual vs planned results
   └─ Close change ticket

4. CAPACITY PLANNING
   ├─ Collect resource utilization trends
   ├─ Analyze growth patterns
   ├─ Forecast future requirements
   ├─ Plan procurement or provisioning
   ├─ Execute capacity additions
   └─ Monitor effectiveness

5. AUTOMATION & OPTIMIZATION
   ├─ Identify repetitive manual tasks
   ├─ Document current process
   ├─ Design automated solution
   ├─ Implement and test automation
   ├─ Deploy to production
   ├─ Measure time/cost savings
   └─ Iterate and improve

Decision Frameworks

Alert Configuration Decision Matrix

Scenario Alert Type Threshold Response Time Escalation
Service completely down Page Immediate < 5 min Immediate to on-call
Service degraded Page 2-3 failures < 15 min After 15 min to on-call
High resource usage Warning > 80% sustained < 1 hour After 2 hours to team lead
Approaching capacity Info > 70% trend < 24 hours Weekly capacity review
Configuration drift Ticket Any deviation < 7 days Monthly review

Incident Severity Classification

Priority 1 (Critical)

  • Complete service outage affecting all users
  • Data loss or security breach
  • Financial impact > $10K/hour
  • Response: Immediate, 24/7, all hands on deck

Priority 2 (High)

  • Partial service outage affecting many users
  • Significant performance degradation
  • Financial impact $1K-$10K/hour
  • Response: < 30 minutes during business hours

Priority 3 (Medium)

  • Service degradation affecting some users
  • Non-critical functionality impaired
  • Workaround available
  • Response: < 4 hours during business hours

Priority 4 (Low)

  • Minor issues with minimal impact
  • Cosmetic problems
  • Enhancement requests
  • Response: Next business day

Change Management Risk Assessment

Risk Level = Impact × Likelihood × Complexity

Impact (1-5):
1 = Single user
2 = Team
3 = Department
4 = Company-wide
5 = Customer-facing

Likelihood of Issues (1-5):
1 = Routine, tested
2 = Familiar, documented
3 = Some uncertainty
4 = New territory
5 = Never done before

Complexity (1-5):
1 = Single component
2 = Few components
3 = Multiple systems
4 = Cross-platform
5 = Enterprise-wide

Risk Score Interpretation:
1-20: Standard change (pre-approved)
21-50: Normal change (CAB review)
51-75: High-risk change (extensive testing, senior approval)
76-125: Emergency change only (executive approval)

Monitoring Tool Selection

Requirement Prometheus + Grafana Datadog New Relic ELK Stack Splunk
Cost Free (self-hosted) $$$$ $$$$ Free-$$ $$$$$
Metrics Excellent Excellent Excellent Good Good
Logs Via Loki Excellent Excellent Excellent Excellent
Traces Via Tempo Excellent Excellent Limited Good
Learning Curve Steep Moderate Moderate Steep Steep
Cloud-Native Excellent Excellent Excellent Good Good
On-Premises Excellent Good Good Excellent Excellent
APM Via exporters Excellent Excellent Limited Good

Common Operational Challenges

Challenge 1: Alert Fatigue

Problem: Too many false positive alerts causing team burnout

Solution:

Alert Tuning Process:
1. Measure baseline alert volume and false positive rate
2. Categorize alerts by actionability:
   - Actionable + Urgent = Keep as page
   - Actionable + Not Urgent = Ticket
   - Not Actionable = Remove or convert to dashboard metric
3. Implement alert aggregation (group similar alerts)
4. Add context to alerts (runbook links, relevant metrics)
5. Regular review meetings (weekly) to tune thresholds
6. Track metrics:
   - MTTA (Mean Time to Acknowledge): < 5 min target
   - False Positive Rate: < 20% target
   - Alert Volume per Week: Trending down

Challenge 2: Incident Documentation During Crisis

Problem: Teams skip documentation during high-pressure incidents

Solution:

  • Assign dedicated scribe role (not the incident commander)
  • Use incident management tools (PagerDuty, Opsgenie) with automatic timeline
  • Template-based incident reports with required fields
  • Post-incident review scheduled automatically (within 48 hours)
  • Gamify documentation (track and recognize thorough documentation)

Challenge 3: Knowledge Silos

Problem: Critical knowledge trapped in individual team members' heads

Solution:

Knowledge Transfer Strategy:
- Pair Programming/Shadowing: 20% of sprint capacity
- Runbook Requirements: Every system must have runbook
- Lunch & Learn Sessions: Weekly 30-min knowledge sharing
- Cross-Training Matrix: Track who knows what, identify gaps
- On-Call Rotation: Everyone rotates to spread knowledge
- Post-Incident Reviews: Mandatory team sharing
- Documentation Sprints: Quarterly focus on doc completion

Challenge 4: Balancing Stability vs Innovation

Problem: Operations team resists change to maintain stability

Solution:

  • Implement change windows (planned maintenance periods)
  • Use blue-green or canary deployments for lower risk
  • Establish "innovation time" (Google 20% time model)
  • Create sandbox environments for experimentation
  • Measure and reward both stability AND improvement metrics
  • Include "toil reduction" as OKR target

Key Metrics & KPIs

Service Reliability Metrics

Availability:
  Formula: (Total Time - Downtime) / Total Time × 100
  Target: 99.9% (43.8 min/month downtime)
  Measurement: Per service, monthly

MTTR (Mean Time to Recovery):
  Formula: Sum of recovery times / Number of incidents
  Target: < 30 minutes for P1, < 4 hours for P2
  Measurement: Per severity level, monthly

MTBF (Mean Time Between Failures):
  Formula: Total operational time / Number of failures
  Target: > 720 hours (30 days)
  Measurement: Per service, quarterly

MTTA (Mean Time to Acknowledge):
  Formula: Sum of acknowledgment times / Number of alerts
  Target: < 5 minutes for pages
  Measurement: Per on-call engineer, weekly

Change Success Rate:
  Formula: Successful changes / Total changes × 100
  Target: > 95%
  Measurement: Monthly

Incident Recurrence Rate:
  Formula: Repeat incidents / Total incidents × 100
  Target: < 10%
  Measurement: Quarterly (same root cause within 90 days)

Operational Efficiency Metrics

Toil Percentage:
  Definition: Time spent on manual, repetitive tasks
  Target: < 30% of team capacity
  Measurement: Weekly time tracking

Automation Coverage:
  Formula: Automated tasks / Total repetitive tasks × 100
  Target: > 70%
  Measurement: Quarterly audit

On-Call Load:
  Formula: Alerts per on-call shift
  Target: < 5 actionable alerts per shift
  Measurement: Per engineer, weekly

Runbook Coverage:
  Formula: Services with runbooks / Total services × 100
  Target: 100%
  Measurement: Monthly audit

Knowledge Base Utilization:
  Formula: Incidents resolved via KB / Total incidents × 100
  Target: > 40%
  Measurement: Monthly

Integration Points

With Development Teams

  • Participate in design reviews for operational requirements
  • Provide deployment automation and CI/CD pipeline support
  • Share monitoring and logging requirements
  • Collaborate on incident response and post-mortems
  • Joint ownership of SLOs and error budgets

With Security Teams

  • Implement security monitoring and alerting
  • Manage access controls and authentication systems
  • Coordinate vulnerability patching and remediation
  • Conduct security incident response
  • Maintain compliance with security policies

With Business Stakeholders

  • Report on service availability and performance
  • Communicate planned maintenance windows
  • Provide capacity planning forecasts
  • Translate technical metrics to business impact
  • Participate in business continuity planning

Best Practices

1. Blameless Post-Mortems

Post-Incident Review Template:
- Incident Summary (what happened, when, impact)
- Timeline of Events (detailed chronology)
- Root Cause Analysis (5 Whys or Fishbone)
- What Went Well (strengths during response)
- What Could Be Improved (opportunities)
- Action 
how to use it-operations

How to use it-operations on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your development machine
  • Node.js version 16.0+ with npm package manager (verify with node --version)
  • Active project directory or workspace where you want to add it-operations
2

Execute installation command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills add https://github.com/davila7/claude-code-templates --skill it-operations

The skills CLI fetches it-operations from GitHub repository davila7/claude-code-templates and configures it for Cursor.

3

Select Cursor when prompted

The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ── always included ────
│ • Amp
│ • Antigravity
│ • Cline
│ • Codex
│ ●Cursor(selected)
│ • Cursor
│ • Windsurf
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/it-operations

Reload or restart Cursor to activate it-operations. Access the skill through slash commands (e.g., /it-operations) or your agent's skill management interface.

Security & Verification Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

GET_STARTED →

Use Cases

User Story & Requirements Generation

Create detailed user stories, acceptance criteria, and feature specs

Example

Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios

Reduce spec writing time by 50%, ensure comprehensive coverage

Competitive Analysis

Research competitors, compare features, identify gaps

Example

Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities

Complete competitive research in 2 hours instead of 2 days

Roadmap Prioritization

Evaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs

Example

Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale

Make data-driven prioritization decisions faster

Stakeholder Communication

Draft PRDs, status updates, and stakeholder presentations

Example

Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement

Save 3-5 hours/week on communication overhead

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client
  • Access to product documentation and roadmap tools (Jira, Notion, etc.)
  • Understanding of product management frameworks (RICE, Jobs-to-be-Done, etc.)
  • Stakeholder contact information and communication channels

Time Estimate

30-60 minutes to see productivity improvements

Installation Steps

  1. 1.Install product management skill
  2. 2.Start with user story generation for known feature
  3. 3.Progress to competitive analysis: research 2-3 competitors
  4. 4.Use for roadmap prioritization: apply RICE/ICE scoring
  5. 5.Draft stakeholder communications and refine based on feedback
  6. 6.Build template library for recurring PM tasks
  7. 7.Share effective prompts with product team

Common Pitfalls

  • Not validating competitive research—verify facts before sharing
  • Accepting user stories without involving engineering team
  • Over-relying on frameworks without qualitative judgment
  • Not customizing outputs to company culture and communication style
  • Skipping stakeholder validation of generated requirements

Best Practices

✓ Do

  • +Validate research and competitive analysis with real data
  • +Collaborate with engineering when generating technical requirements
  • +Customize frameworks and templates to your company context
  • +Use skill for first drafts, refine with stakeholder input
  • +Document successful prompt patterns for PM tasks
  • +Combine AI efficiency with human judgment and intuition

✗ Don't

  • Don't publish competitive analysis without fact-checking
  • Don't finalize user stories without engineering review
  • Don't make prioritization decisions solely on AI scoring
  • Don't skip customer validation of generated requirements
  • Don't ignore company-specific context and culture

💡 Pro Tips

  • Provide context: company goals, constraints, customer feedback
  • Ask for alternatives: 'Show 3 ways to prioritize this roadmap'
  • Request stakeholder-specific formatting: 'Executive summary vs. engineering spec'
  • Use skill for 70% generation + 30% customization to company needs

When to Use This

✓ Use When

Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work.

✗ Avoid When

Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed.

Learning Path

  1. 1Basic: user stories, feature specs, status updates
  2. 2Intermediate: competitive analysis, prioritization frameworks, PRDs
  3. 3Advanced: product strategy, go-to-market planning, OKR setting
  4. 4Expert: product vision, market positioning, business model innovation

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.726 reviews
  • Mateo Gill· Nov 3, 2024

    I recommend it-operations for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Mateo Rao· Oct 22, 2024

    it-operations reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Oshnikdeep· Sep 25, 2024

    it-operations reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Soo Nasser· Sep 25, 2024

    Registry listing for it-operations matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Aditi Desai· Sep 9, 2024

    Solid pick for teams standardizing on skills: it-operations is focused, and the summary matches what you get after install.

  • Piyush G· Sep 1, 2024

    it-operations fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Min Rahman· Aug 28, 2024

    We added it-operations from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Shikha Mishra· Aug 20, 2024

    it-operations is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Ganesh Mohane· Aug 16, 2024

    I recommend it-operations for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Maya Martinez· Aug 16, 2024

    Keeps context tight: it-operations is the kind of skill you can hand to a new teammate without a long onboarding doc.

showing 1-10 of 26

1 / 3