observability-service-health

elastic/agent-skills · updated Apr 8, 2026

MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.

$npx skills add https://github.com/elastic/agent-skills --skill observability-service-health
0 commentsdiscussion
summary

Assess APM service health using Observability APIs,

  • ES|QL against APM indices, Elasticsearch APIs, and (for correlation and APM-specific logic) the Kibana repo. Use
  • SLOs, firing alerts, ML anomalies, throughput, latency (avg/p95/p99), error rate, and dependency health.
skill.md

APM Service Health

Assess APM service health using Observability APIs, ES|QL against APM indices, Elasticsearch APIs, and (for correlation and APM-specific logic) the Kibana repo. Use SLOs, firing alerts, ML anomalies, throughput, latency (avg/p95/p99), error rate, and dependency health.

Where to look

  • Observability APIs (Observability APIs): Use the SLOs API (Stack | Serverless) to get SLO definitions, status, burn rate, and error budget. Use the Alerting API (Stack | Serverless) to list and manage alerting rules and their alerts for the service. Use APM annotations API to create or search annotations when needed.
  • ES|QL and Elasticsearch: Query traces*apm*,traces*otel* and metrics*apm*,metrics*otel* with ES|QL (see Using ES|QL for APM metrics) for throughput, latency, error rate, and dependency-style aggregations. Use Elasticsearch APIs (e.g. POST _query for ES|QL, or Query DSL) as documented in the Elasticsearch repo for indices and search.
  • APM Correlations: Run the apm-correlations script to get attributes that correlate with high-latency or failed transactions for a given service. It tries the Kibana internal APM correlations API first, then falls back to Elasticsearch significant_terms on traces*apm*,traces*otel*. See APM Correlations script.
  • Infrastructure: Correlate via resource attributes (e.g. k8s.pod.name, container.id, host.name) in traces; query infrastructure or metrics indices with ES|QL/Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health.
  • Logs: Use ES|QL or Elasticsearch search on log indices filtered by service.name or trace.id to explain behavior and root cause.
  • Observability Labs: Observability Labs and APM tag for patterns and troubleshooting.

Health criteria

Synthesize health from all of the following when available:

Signal What to check
SLOs Burn rate, status (healthy/degrading/violated), error budget.
Firing alerts Open or recently fired alerts for the service or dependencies.
ML anomalies Anomaly jobs; score and severity for latency, throughput, or error rate.
Throughput Request rate; compare to baseline or previous period.
Latency Avg, p95, p99; compare to SLO targets or history.
Error rate Failed/total requests; spikes or sustained elevation.
Dependency health Downstream latency, error rate, availability (ES|QL, APIs, Kibana repo).
Infrastructure CPU usage, memory; OOM and CPU throttling on pods/containers/hosts.
Logs App logs filtered by service or trace ID for context and root cause.

Treat a service as unhealthy if SLOs are violated, critical alerts are firing, or ML anomalies indicate severe degradation. Correlate with infrastructure (OOM, CPU throttling), dependencies, and logs (service/trace context) to explain why and suggest next steps.

Using ES|QL for APM metrics

When querying APM data from Elasticsearch (traces*apm*,traces*otel*, metrics*apm*,metrics*otel*), use ES|QL by default where available.

  • Availability: ES|QL is available in Elasticsearch 8.11+ (technical preview; GA in 8.14). It is always available in Elastic Observability Serverless Complete tier.
  • Scoping to a service: Always filter by service.name (and service.environment when relevant). Combine with a time range on @timestamp:
WHERE service.name == "my-service-name" AND service.environment == "production"
  AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  • Example patterns: Throughput, latency, and error rate over time: see Kibana trace_charts_definition.ts (getThroughputChart, getLatencyChart, getErrorRateChart). Use from(index)where(...)stats(...) / evaluate(...) with BUCKET(@timestamp, ...) and WHERE service.name == "<service_name>".
  • Performance: Add LIMIT n to cap rows and token usage. Prefer coarser BUCKET(@timestamp, ...) (e.g. 1 hour) when only trends are needed; finer buckets increase work and result size.

APM Correlations script

When only a subpopulation of transactions has high latency or failures, run the apm-correlations script to list attributes that correlate with those transactions (e.g. host, service version, pod, region). The script tries the Kibana internal APM correlations API first; if unavailable (e.g. 404), it falls back to Elasticsearch significant_terms on traces*apm*,traces*otel*.

# Latency correlations (attributes over-represented in slow transactions)
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]

# Failed transaction correlations
node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]

# Test Kibana connection
node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]

Environment: KIBANA_URL and KIBANA_API_KEY (or KIBANA_USERNAME/KIBANA_PASSWORD) for Kibana; for fallback, ELASTICSEARCH_URL and ELASTICSEARCH_API_KEY. Use the same time range as the investigation.

Workflow

Service health progress:
- [ ] Step 1: Identify the service (and time range)
- [ ] Step 2: Check SLOs and firing alerts
- [ ] Step 3: Check ML anomalies (if configured)
- [ ] Step 4: Review throughput, latency (avg/p95/p99), error rate
- [ ] Step 5: Assess dependency health (ES|QL/APIs / Kibana repo)
- [ ] Step 6: Correlate with infrastructure and logs
- [ ] Step 7: Summarize health and recommend actions

Step 1: Identify the service

Confirm service name and time range. Resolve the service from the request; if multiple are in scope, target the most relevant. Use ES|QL on traces*apm*,traces*otel* or metrics*apm*,metrics*otel* (e.g. WHERE service.name == "<name>") or Kibana repo APM routes to obtain service-level data. If the user has not provided the time range, assume last hour.

Step 2: Check SLOs and firing alerts

SLOs: Call the SLOs API to get SLO definitions and status for the service (latency, availability), healthy/degrading/violated, burn rate, error budget. Alerts: For active APM alerts, call /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active. When checking one service, include both rules where params.serviceName matches the service and rules where params.serviceName is absent (all-services rules). Do not query .alerts* indices for active-state checks. Correlate with SLO violations or metric changes.

Step 3: Check ML anomalies

If ML anomaly detection is used, query ML job results or anomaly records (via Elasticsearch ML APIs or indices) for the service and time range. Note high-severity anomalies (latency, throughput, error rate); use anomaly time windows to narrow Steps 4–5.

Step 4: Review throughput, latency, and error rate

Use ES|QL against traces*apm*,traces*otel* or metrics*apm*,metrics*otel* for the service and time range to get throughput (e.g. req/min), latency (avg, p95, p99), error rate (failed/total or 5xx/total). Example: FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS .... Compare to prior period or SLO targets. See Using ES|QL for APM metrics.

Step 5: Assess dependency health

Obtain dependency and service-map data via ES|QL on traces*apm*,traces*otel*/metrics*apm*,metrics*otel* (e.g. downstream service/span aggregations) or via APM route handlers in the Kibana repo that expose dependency/service-map data. For the service and time range, note downstream latency and error rate; flag slow or failing dependencies as likely causes.

Step 6: Correlate with infrastructure and logs

  • APM Correlations (when only a subpopulation is affected): Run node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...] to get correlated attributes. Filter by those attributes and fetch trace samples or errors to confirm root cause. See APM Correlations script.
  • Infrastructure: Use resource attributes from traces (e.g. k8s.pod.name, container.id, host.name) and query infrastructure/metrics indices with ES|QL or Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health; correlate their time windows with APM degradation.
  • Logs: Use ES|QL or Elasticsearch on log indices with service.name == "<service_name>" or trace.id == "<trace_id>" to explain behavior and root cause (exceptions, timeouts, restarts).

Step 7: Summarize and recommend

State health (healthy / degraded / unhealthy) with reasons; list concrete next steps.

Examples

Example: ES|QL for a specific service

Scope with WHERE service.name == "<service_name>" and time range. Throughput and error rate (1-hour buckets; LIMIT caps rows and tokens):

FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
  AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500

Latency percentiles and exact field names: see Kibana trace_charts_definition.ts.

Example: "Is service X healthy?"

  1. Resolve service X and time range. Call SLOs API and Alerting API; run ES|QL on traces*apm*,traces*otel*/metrics*apm*,metrics*otel* for throughput, latency, error rate; query dependency/service-map data (ES|QL or Kibana repo).
  2. Evaluate SLO status (violated/degrading?), firing rules, ML anomalies, and dependency health.
  3. Answer: Healthy / Degraded / Unhealthy with reasons and next steps (e.g. Observability Labs).

Example: "Why is service Y slow?"

  1. Service Y and slowness time range. Call SLOs API and Alerting API; run ES|QL for Y and dependencies; query ML anomaly results.
  2. Compare latency (avg/p95/p99) to prior period via ES|QL; from dependency data identify high-latency or failing deps.
  3. Summarize (e.g. p99 up; dependency Z elevated) and recommend (investigate Z; Observability Labs for latency).

Example: Correlate service to infrastructure (OpenTelemetry)

Use resource attributes on spans/traces to get the runtimes (pods, containers, hosts) for the service. Then check CPU and memory for those resources in the same time window as the APM issue:

  • From the service’s traces or metrics, read resource attributes such as k8s.pod.name, k8s.namespace.name, container.id, or host.name.
  • Run ES|QL or Elasticsearch search on infrastructure/metrics indices filtered by those resource values and the incident time range. Check CPU usage and memory consumption (e.g. system.cpu.total.norm.pct); look for OOMKilled events, CPU throttling, or sustained high CPU/memory that align with APM latency or error spikes.

Example: Filter logs by service or trace ID

To understand behavior for a specific service or a single trace, filter logs accordingly:

  • By service: Run ES|QL or Elasticsearch search on log indices with service.name == "<service_name>" and time range to get application logs (errors, warnings, restarts) in the service context.
  • By trace ID: When investigating a specific request, take the trace.id from the APM trace and filter logs by trace.id == "<trace_id>" (or equivalent field in your log schema). Logs with that trace ID show the full request path and help explain failures or latency.

Guidelines

  • Use Observability APIs (SLOs API, Alerting API) and ES|QL on traces*apm*,traces*otel*/metrics*apm*,metrics*otel* (8.11+ or Serverless), filtering by service.name (and service.environment when relevant). For active APM alerts, call /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active. When checking one service, evaluate both rule types: rules where params.serviceName matches the target service, and rules where params.serviceName is absent (all-services rules). Treat either as applicable to the service before declaring health. Do not query .alerts* indices when determining currently active alerts; use the Alerting API response above as the source of truth. For APM correlations, run the apm-correlations script (see APM Correlations script); for dependency/service-map data, use ES|QL or Kibana repo route handlers. For Elasticsearch index and search behavior, see the Elasticsearch APIs in the Elasticsearch repo.
  • Always use the user's time range; avoid assuming "last 1 hour" if the issue is historical.
  • When SLOs exist, anchor the health summary to SLO status and burn rate; when they do not, rely on alerts, anomalies, throughput, latency, error rate, and dependencies.
  • When analyzing only application metrics ingested via OpenTelemetry, use the ES|QL TS (time series) command for efficient metrics queries. The TS command is available in Elasticsearch 9.3+ and is always available in Elastic Observability Serverless.
  • Summary: one short health verdict plus bullet points for evidence and next steps.
how to use observability-service-health

How to use observability-service-health on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your development machine
  • Node.js version 16.0+ with npm package manager (verify with node --version)
  • Active project directory or workspace where you want to add observability-service-health
2

Execute installation command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills add https://github.com/elastic/agent-skills --skill observability-service-health

The skills CLI fetches observability-service-health from GitHub repository elastic/agent-skills and configures it for Cursor.

3

Select Cursor when prompted

The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ── always included ────
│ • Amp
│ • Antigravity
│ • Cline
│ • Codex
│ ●Cursor(selected)
│ • Cursor
│ • Windsurf
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/observability-service-health

Reload or restart Cursor to activate observability-service-health. Access the skill through slash commands (e.g., /observability-service-health) or your agent's skill management interface.

Security & Verification Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

GET_STARTED →

Use Cases

User Story & Requirements Generation

Create detailed user stories, acceptance criteria, and feature specs

Example

Generate user stories for 'password reset feature' with acceptance criteria, edge cases, and test scenarios

Reduce spec writing time by 50%, ensure comprehensive coverage

Competitive Analysis

Research competitors, compare features, identify gaps

Example

Analyze 5 competitor products, create feature comparison matrix, suggest differentiation opportunities

Complete competitive research in 2 hours instead of 2 days

Roadmap Prioritization

Evaluate features using frameworks (RICE, ICE, Kano) and create prioritized backlogs

Example

Score 20 feature ideas using RICE framework, generate prioritized roadmap with rationale

Make data-driven prioritization decisions faster

Stakeholder Communication

Draft PRDs, status updates, and stakeholder presentations

Example

Create executive summary of Q3 roadmap, monthly progress report, feature launch announcement

Save 3-5 hours/week on communication overhead

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client
  • Access to product documentation and roadmap tools (Jira, Notion, etc.)
  • Understanding of product management frameworks (RICE, Jobs-to-be-Done, etc.)
  • Stakeholder contact information and communication channels

Time Estimate

30-60 minutes to see productivity improvements

Installation Steps

  1. 1.Install product management skill
  2. 2.Start with user story generation for known feature
  3. 3.Progress to competitive analysis: research 2-3 competitors
  4. 4.Use for roadmap prioritization: apply RICE/ICE scoring
  5. 5.Draft stakeholder communications and refine based on feedback
  6. 6.Build template library for recurring PM tasks
  7. 7.Share effective prompts with product team

Common Pitfalls

  • Not validating competitive research—verify facts before sharing
  • Accepting user stories without involving engineering team
  • Over-relying on frameworks without qualitative judgment
  • Not customizing outputs to company culture and communication style
  • Skipping stakeholder validation of generated requirements

Best Practices

✓ Do

  • +Validate research and competitive analysis with real data
  • +Collaborate with engineering when generating technical requirements
  • +Customize frameworks and templates to your company context
  • +Use skill for first drafts, refine with stakeholder input
  • +Document successful prompt patterns for PM tasks
  • +Combine AI efficiency with human judgment and intuition

✗ Don't

  • Don't publish competitive analysis without fact-checking
  • Don't finalize user stories without engineering review
  • Don't make prioritization decisions solely on AI scoring
  • Don't skip customer validation of generated requirements
  • Don't ignore company-specific context and culture

💡 Pro Tips

  • Provide context: company goals, constraints, customer feedback
  • Ask for alternatives: 'Show 3 ways to prioritize this roadmap'
  • Request stakeholder-specific formatting: 'Executive summary vs. engineering spec'
  • Use skill for 70% generation + 30% customization to company needs

When to Use This

✓ Use When

Use for user story writing, competitive research, roadmap prioritization, stakeholder communication, and PRD drafting. Best for reducing repetitive documentation and research work.

✗ Avoid When

Avoid for strategic product vision (requires deep customer empathy), pricing decisions (needs market and financial expertise), or when face-to-face customer discovery is more valuable than speed.

Learning Path

  1. 1Basic: user stories, feature specs, status updates
  2. 2Intermediate: competitive analysis, prioritization frameworks, PRDs
  3. 3Advanced: product strategy, go-to-market planning, OKR setting
  4. 4Expert: product vision, market positioning, business model innovation

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.742 reviews
  • Chaitanya Patil· Dec 16, 2024

    observability-service-health reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Aditi Khanna· Dec 16, 2024

    I recommend observability-service-health for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Pratham Ware· Dec 12, 2024

    Keeps context tight: observability-service-health is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • William Bansal· Dec 4, 2024

    Keeps context tight: observability-service-health is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Kaira Malhotra· Nov 23, 2024

    observability-service-health has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Piyush G· Nov 7, 2024

    I recommend observability-service-health for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Harper Chawla· Nov 7, 2024

    observability-service-health reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Shikha Mishra· Oct 26, 2024

    Useful defaults in observability-service-health — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Michael Menon· Oct 26, 2024

    Registry listing for observability-service-health matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Kabir Kapoor· Oct 14, 2024

    Solid pick for teams standardizing on skills: observability-service-health is focused, and the summary matches what you get after install.

showing 1-10 of 42

1 / 5