observability-llm-obs

elastic/agent-skills · updated Apr 8, 2026

MDX-style export adds YAML metadata + attribution linking explainx.ai and this canonical listing URL.

$npx skills add https://github.com/elastic/agent-skills --skill observability-llm-obs
0 commentsdiscussion
summary

Answer user questions about monitoring LLMs and agentic components using data ingested into Elastic only. Focus on

  • LLM performance, cost and token utilization, response quality, and call chaining or agentic workflow orchestration. Use
  • ES|QL, Elasticsearch APIs, and (where needed) Kibana APIs. Do not rely on Kibana UI; the skill works without it. A
  • given deployment typically uses one or more ingestion paths (APM/OTLP traces and/or integration metrics/logs)—
  • discover what is available before query
skill.md

LLM and Agentic Observability

Answer user questions about monitoring LLMs and agentic components using data ingested into Elastic only. Focus on LLM performance, cost and token utilization, response quality, and call chaining or agentic workflow orchestration. Use ES|QL, Elasticsearch APIs, and (where needed) Kibana APIs. Do not rely on Kibana UI; the skill works without it. A given deployment typically uses one or more ingestion paths (APM/OTLP traces and/or integration metrics/logs)— discover what is available before querying.

Where to look

  • Trace and metrics data (APM / OTel): Trace data in Elastic is stored in traces* when collected by the Elastic APM Agent, and in traces-generic.otel-default (and similar) when collected by OpenTelemetry. Use the generic pattern traces* to find all trace data regardless of source. When the application is instrumented with OpenTelemetry (e.g. Elastic Distributions of OpenTelemetry (EDOT), OpenLLMetry, OpenLIT, Langtrace exporting to OTLP), LLM and agent spans land in these trace data streams; metrics may land in metrics-apm* or metrics-generic. Query traces* and metrics* data streams for per-request and aggregated LLM signals.
  • Integration metrics and logs: When the user collects data via Elastic LLM integrations (OpenAI, Azure OpenAI, Azure AI Foundry, Amazon Bedrock, Bedrock AgentCore, GCP Vertex AI, etc.), metrics and logs go to integration data streams (e.g. metrics*, logs* with dataset/namespace per integration). Check which data streams exist.
  • Discover first: Use Elasticsearch to list data streams or indices (e.g. GET _data_stream, or GET traces*/_mapping, GET metrics*/_mapping) and optionally sample a document to see which LLM-related fields are present. Do not assume both APM and integration data exist.
  • ES|QL: Use the elasticsearch-esql skill for ES|QL syntax, commands, and query patterns when building queries against traces* or metrics data streams.
  • Alerts and SLOs: Use the Observability APIs SLOs API (Stack | Serverless) and Alerting API (Stack | Serverless) to find SLOs and alerting rules that target LLM-related data (e.g. services backed by traces*, or integration metrics). Firing alerts or violated/degrading SLOs point to potential degraded performance.

Data available in Elastic

From traces and metrics (traces*, metrics-apm* / metrics-generic)

Spans from OTel/EDOT (and compatible SDKs) carry span attributes that may follow OpenTelemetry GenAI semantic conventions or provider-specific names. In Elasticsearch, attributes typically appear under span.attributes (exact key names depend on ingestion). Common attributes:

Purpose Example attribute names (OTel GenAI)
Operation / provider gen_ai.operation.name, gen_ai.provider.name
Model gen_ai.request.model, gen_ai.response.model
Token usage gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
Request config gen_ai.request.temperature, gen_ai.request.max_tokens
Errors error.type
Conversation / agent gen_ai.conversation.id; tool/agent spans as child spans

Cost is not in the OTel spec; some instrumentations add custom attributes (e.g. llm.response.cost.usd_estimate). Discover actual field names from the index mapping or a sample document (e.g. span.attributes.* or flattened keys).

Use duration and event.outcome on spans for latency and success/failure. Use trace.id, span.id, and parent/child span relationships to analyze call chaining and agentic workflows (e.g. one root span, multiple LLM or tool-call child spans).

From LLM integrations

Integrations (OpenAI, Azure OpenAI, Azure AI Foundry, Bedrock, Bedrock AgentCore, Vertex AI, etc.) ship metrics (and where supported logs) to Elastic. Metrics typically include token usage, request counts, latency, and—where the integration supports it—cost-related fields. Logs may include prompt/response or guardrail events. Exact field names and data streams are defined by each integration package; discover them from the integration docs or from the target data stream mapping.

Determine what data is available

  1. List data streams: GET _data_stream and filter for traces*, metrics-apm* (or metrics*), and metrics-* / logs-* that match known LLM integration datasets (e.g. from Elastic LLM observability).
  2. Inspect trace indices: For traces*, run a small search or use mapping to see if spans contain gen_ai.* or llm.* (or similar) attributes. Confirm presence of token, model, and duration fields.
  3. Inspect integration indices: For metrics/logs data streams, check mapping or one document to see token, cost, latency, and model dimensions.
  4. Use one source per use case: If both APM and integration data exist, prefer one consistent source for a given question (e.g. use traces for per-request chain analysis, integration metrics for aggregate token/cost).
  5. Check alerts and SLOs: Use the SLOs API and Alerting API to list SLOs and alerting rules that target LLM-related services or integration metrics, and to get open or recently fired alerts. Firing alerts or SLOs in degrading/violated status point to potential degraded performance.

Use cases and query patterns

LLM performance (latency, throughput, errors)

  • Traces: ES|QL on traces* filtered by span attributes (e.g. gen_ai.operation.name or gen_ai.provider.name when present). Compute throughput (count per time bucket), latency (e.g. duration.us or span duration), and error rate (event.outcome == "failure") by model, service, or time.
  • Integrations: Query integration metrics for request rate, latency, and error metrics by model/dimension as exposed by the integration.

Cost and token utilization

  • Traces: Aggregate from spans in traces*: sum gen_ai.usage.input_tokens and gen_ai.usage.output_tokens (or equivalent attribute names) by time, model, or service. If a cost attribute exists (e.g. custom llm.response.cost.*), sum it for cost views.
  • Integrations: Use integration metrics that expose token counts and/or cost; aggregate by time and model.

Response quality and safety

  • Traces: Use event.outcome, error.type, and span attributes (e.g. gen_ai.response.finish_reasons) in traces* to identify failures, timeouts, or content filters. Correlate with prompts/responses if captured in attributes (e.g. gen_ai.input.messages, gen_ai.output.messages) and not redacted.
  • Integrations: Query integration logs for guardrail blocks, content filter events, or policy violations (e.g. Bedrock Guardrails) using the fields defined by that integration.

Call chaining and agentic workflow orchestration

  • Traces only: Use trace hierarchy in traces*. Filter by root service or trace attributes; group by trace.id and use parent/child span relationships (e.g. parent.id, span.id) to reconstruct chains (e.g. orchestration span → multiple LLM or tool-call spans). Aggregate by span name or gen_ai.operation.name to see distribution of steps (e.g. retrieval, LLM, tool use). Duration per span and per trace gives bottleneck and end-to-end latency.

Using ES|QL for LLM data

  • Availability: ES|QL is available in Elasticsearch 8.11+ (GA in 8.14) and in Elastic Observability Serverless.
  • Scoping: Always restrict by time range (@timestamp). When present, add service.name and optionally service.environment. For LLM-specific spans, filter by span attributes once you know the field names (e.g. a keyword field for gen_ai.provider.name or gen_ai.operation.name).
  • Performance: Use LIMIT, coarse time buckets when only trends are needed, and avoid full scans over large windows.

Workflow

LLM observability progress:
- [ ] Step 1: Determine available data (traces*, metrics-apm* or metrics*, or integration data streams)
- [ ] Step 2: Discover LLM-related field names (mapping or sample doc)
- [ ] Step 3: Run ES|QL or Elasticsearch queries for the user's question (performance, cost, quality, orchestration)
- [ ] Step 4: Check for active alerts or SLOs defined on LLM-related data (Alerting API, SLOs API); field names from
        Step 2 help identify related rules; firing alerts or violated/degrading SLOs indicate potential degraded performance
- [ ] Step 5: Summarize findings from ingested data only; include alert/SLO status when relevant

Examples

Example: Token usage over time from traces

Assume span attributes are available as span.attributes.gen_ai.usage.input_tokens and span.attributes.gen_ai.usage.output_tokens (adjust to actual field names from mapping):

FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  AND span.attributes.gen_ai.provider.name IS NOT NULL
| STATS
    input_tokens = SUM(span.attributes.gen_ai.usage.input_tokens),
    output_tokens = SUM(span.attributes.gen_ai.usage.output_tokens)
  BY BUCKET(@timestamp, 1 hour), span.attributes.gen_ai.request.model
| SORT @timestamp
| LIMIT 500

Example: Latency and error rate by model

FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  AND span.attributes.gen_ai.request.model IS NOT NULL
| STATS
    request_count = COUNT(*),
    failures = COUNT(*) WHERE event.outcome == "failure",
    avg_duration_us = AVG(span.duration.us)
  BY span.attributes.gen_ai.request.model
| EVAL error_rate = failures / request_count
| LIMIT 100

Example: Agentic workflow (trace-level view)

Get trace IDs that contain at least one LLM span and count spans per trace to see chain length:

FROM traces*
| WHERE @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  AND span.attributes.gen_ai.operation.name IS NOT NULL
| STATS span_count = COUNT(*), total_duration_us = SUM(span.duration.us) BY trace.id
| WHERE span_count > 1
| SORT total_duration_us DESC
| LIMIT 50

Example: Integration metrics (Amazon Bedrock AgentCore)

The Amazon Bedrock AgentCore integration ships metrics to the metrics-aws_bedrock_agentcore.metrics-* data stream (time series index). Use TS for aggregations on time series data streams (Elasticsearch 9.2+); use a time range with TRANGE (9.3+). The integration’s dashboards and alerting rule templates Example: token usage (counter), invocations (counter), and average latency (gauge) by hour and agent:

TS metrics-aws_bedrock_agentcore.metrics-*
| WHERE TRANGE(7 days)
  AND aws.dimensions.Operation == "InvokeAgentRuntime"
| STATS
    total_tokens = SUM(RATE(aws.bedrock_agentcore.metrics.TokenCount.sum)),
    total_invocations = SUM(RATE(aws.bedrock_agentcore.metrics.Invocations.sum)),
    avg_latency_ms = AVG(AVG_OVER_TIME(aws.bedrock_agentcore.metrics.Latency.avg))
  BY TBUCKET(1 hour), aws.bedrock_agentcore.agent_name
| SORT TBUCKET(1 hour) DESC

For Elasticsearch 8.x or when TS is not available, use FROM with BUCKET(@timestamp, 1 hour) and SUM/AVG over the metric fields (as in the integration's alert rule templates). For other LLM integrations (OpenAI, Azure OpenAI, Vertex AI, etc.), use that integration’s data stream index pattern and field names from its package (see Elastic LLM observability).

Guidelines

  • Data only in Elastic: Use only data collected and stored in Elastic (traces in traces*, metrics, or integration metrics/logs). Do not describe or rely on other vendors’ UIs or products.
  • One technology per customer: Assume a single ingestion path per deployment when answering; discover which (traces vs integration) exists and use it consistently for the question.
  • Discover field names: Before writing ES|QL or Query DSL, confirm LLM-related attribute or metric names from _mapping or a sample document; naming may differ (e.g. gen_ai.* vs llm.* or integration-specific fields).
  • No Kibana UI dependency: Prefer ES|QL and Elasticsearch APIs; use Kibana APIs only when needed (e.g. SLO, alerting). Do not instruct the user to open Kibana UI.
  • References: LLM and agentic AI observability, Observability Labs – LLM Observability, OpenTelemetry GenAI spans. For ES|QL syntax and query patterns, use the elasticsearch-esql skill, or look through ES|QL TS command reference for Elastic v9.3 or higher and for Serverless, and look through ES|QL FROM command reference for other Elastic versions.
how to use observability-llm-obs

How to use observability-llm-obs on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your development machine
  • Node.js version 16.0+ with npm package manager (verify with node --version)
  • Active project directory or workspace where you want to add observability-llm-obs
2

Execute installation command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills add https://github.com/elastic/agent-skills --skill observability-llm-obs

The skills CLI fetches observability-llm-obs from GitHub repository elastic/agent-skills and configures it for Cursor.

3

Select Cursor when prompted

The CLI will show a list of available agents. Use arrow keys to navigate and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ── always included ────
│ • Amp
│ • Antigravity
│ • Cline
│ • Codex
│ ●Cursor(selected)
│ • Cursor
│ • Windsurf
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/observability-llm-obs

Reload or restart Cursor to activate observability-llm-obs. Access the skill through slash commands (e.g., /observability-llm-obs) or your agent's skill management interface.

Security & Verification Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your development environment. Always verify the publisher's identity, review recent commits, and test in isolated environments before production deployment.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

GET_STARTED →

Use Cases

Task Automation & Efficiency

Automate repetitive workflows and reduce manual effort

Example

Generate reports, summarize documents, draft communications

Save 3-5 hours per week on routine tasks

Knowledge Enhancement

Learn new skills, understand complex topics, get expert guidance

Example

Explain concepts, provide examples, suggest learning resources

Accelerate learning and skill development by 2x

Quality Improvement

Enhance output quality through reviews, suggestions, and refinements

Example

Review drafts, suggest improvements, catch errors

Improve work quality by 30-40% with less effort

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client with skill support
  • Clear understanding of task or problem to solve
  • Willingness to iterate and refine outputs

Time Estimate

15-45 minutes depending on use case complexity

Installation Steps

  1. 1.Install skill using provided installation command
  2. 2.Test with simple use case relevant to your work
  3. 3.Evaluate output quality and relevance
  4. 4.Iterate on prompts to improve results
  5. 5.Integrate into regular workflow if valuable

Common Pitfalls

  • Expecting perfect results without iteration
  • Not providing enough context in prompts
  • Using skill for tasks outside its intended scope
  • Accepting outputs without review and validation

Best Practices

✓ Do

  • +Start with clear, specific prompts
  • +Provide relevant context and constraints
  • +Review and refine all outputs before using
  • +Iterate to improve output quality
  • +Document successful prompt patterns

✗ Don't

  • Don't use without understanding skill limitations
  • Don't skip validation of outputs
  • Don't share sensitive information in prompts
  • Don't expect skill to replace human judgment

💡 Pro Tips

  • Be specific about desired format and style
  • Ask for multiple options to choose from
  • Request explanations to understand reasoning
  • Combine AI efficiency with human expertise

When to Use This

✓ Use When

Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.

✗ Avoid When

Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.

Learning Path

  1. 1Familiarize yourself with skill capabilities and limitations
  2. 2Start with low-risk, non-critical tasks
  3. 3Progress to more complex and valuable use cases
  4. 4Build expertise through regular use and experimentation

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.769 reviews
  • Jin Khan· Dec 28, 2024

    Solid pick for teams standardizing on skills: observability-llm-obs is focused, and the summary matches what you get after install.

  • Neel Martinez· Dec 24, 2024

    observability-llm-obs has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Alexander Dixit· Dec 24, 2024

    observability-llm-obs reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Isabella Kapoor· Dec 20, 2024

    Keeps context tight: observability-llm-obs is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Amina Iyer· Dec 16, 2024

    observability-llm-obs fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Ishan Shah· Dec 12, 2024

    observability-llm-obs has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Isabella Verma· Dec 8, 2024

    observability-llm-obs is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Ganesh Mohane· Dec 4, 2024

    Solid pick for teams standardizing on skills: observability-llm-obs is focused, and the summary matches what you get after install.

  • Sakshi Patil· Nov 23, 2024

    We added observability-llm-obs from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Jin Torres· Nov 19, 2024

    We added observability-llm-obs from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

showing 1-10 of 69

1 / 7