Productivity

distributed-tracing

sickn33/antigravity-awesome-skills · updated Apr 8, 2026

$npx skills add https://github.com/sickn33/antigravity-awesome-skills --skill distributed-tracing
summary

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.

skill.md

Distributed Tracing

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.

Do not use this skill when

  • The task is unrelated to distributed tracing
  • You need a different domain or tool outside this scope

Instructions

  • Clarify goals, constraints, and required inputs.
  • Apply relevant best practices and validate outcomes.
  • Provide actionable steps and verification.
  • If detailed examples are required, open resources/implementation-playbook.md.

Purpose

Track requests across distributed systems to understand latency, dependencies, and failure points.

Use this skill when

  • Debug latency issues
  • Understand service dependencies
  • Identify bottlenecks
  • Trace error propagation
  • Analyze request paths

Distributed Tracing Concepts

Trace Structure

Trace (Request ID: abc123)
Span (frontend) [100ms]
Span (api-gateway) [80ms]
  ├→ Span (auth-service) [10ms]
  └→ Span (user-service) [60ms]
      └→ Span (database) [40ms]

Key Components

  • Trace - End-to-end request journey
  • Span - Single operation within a trace
  • Context - Metadata propagated between services
  • Tags - Key-value pairs for filtering
  • Logs - Timestamped events within a span

Jaeger Setup

Kubernetes Deployment

# Deploy Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability

# Deploy Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
  ingress:
    enabled: true
EOF

Docker Compose

version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "6832:6832/udp"
      - "5778:5778"
      - "16686:16686"  # UI
      - "14268:14268"  # Collector
      - "14250:14250"  # gRPC
      - "9411:9411"    # Zipkin
    environment:
      - COLLECTOR_ZIPKIN_HOST_PORT=:9411

Reference: See references/jaeger-setup.md

Application Instrumentation

OpenTelemetry (Recommended)

Python (Flask)

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask

# Initialize tracer
resource = Resource(attributes={SERVICE_NAME: "my-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route('/api/users')
def get_users():
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("get_users") as span:
        span.set_attribute("user.count", 100)
        # Business logic
        users = fetch_users_from_db()
        return {"users": users}

def fetch_users_from_db():
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("database_query") as span:
        span.set_attribute("db.system", "postgresql")
        span.set_attribute("db.statement", "SELECT * FROM users")
        # Database query
        return query_database()

Node.js (Express)

const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { BatchSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

// Initialize tracer
const provider = new NodeTracerProvider({
  resource: { attributes: { 'service.name': 'my-service' } }
});

const exporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces'
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Instrument libraries
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

const express = require('express');
const app = express();

app.get('/api/users', async (req, res) => {
  const tracer = trace.getTracer('my-service');
  const span = tracer.startSpan('get_users');

  try {
    const users = await fetchUsers();
    span.setAttributes({ 'user.count': users.length });
    res.json({ users });
  } finally {
    span.end();
  }
});

Go

package main

import (
    "context"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func getUsers(ctx context.Context) ([]User, error) {
    tracer := otel.Tracer("my-service")
    ctx, span := tracer.Start(ctx, "get_users")
    defer span.End()

    span.SetAttributes(attribute.String("user.filter", "active"))

    users, err := fetchUsersFromDB(ctx)
    if err != nil {
        span.RecordError(err)
        return nil, err
    }

    span.SetAttributes(attribute.Int("user.count", len(users)))
    return users, nil
}

Reference: See references/instrumentation.md

Context Propagation

HTTP Headers

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE

Propagation in HTTP Requests

Python

from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Injects trace context

response = requests.get('http://downstream-service/api', headers=headers)

Node.js

const { propagation } = require('@opentelemetry/api');

const headers = {};
propagation.inject(context.active(), headers);

axios.get('http://downstream-service/api', { headers });

Tempo Setup (Grafana)

Kubernetes Deployment

apiVersion: v1
kind: ConfigMap
metadata:
  name: tempo-config
data:
  tempo.yaml: |
    server:
      http_listen_port: 3200

    distributor:
      receivers:
        jaeger:
          protocols:
            thrift_http:
            grpc:
        otlp:
          protocols:
            http:
            grpc:

    storage:
      trace:
        backend: s3
        s3:
          bucket: tempo-traces
          endpoint: s3.amazonaws.com

    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tempo
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: tempo
        image: grafana/tempo:latest
        args:
          - -config.file=/etc/tempo/tempo.yaml
        volumeMounts:
        - name: config
          mountPath: /etc/tempo
      volumes:
      - name: config
        configMap:
          name: tempo-config

Reference: See assets/jaeger-config.yaml.template

Sampling Strategies

Probabilistic Sampling

# Sample 1% of traces
sampler:
  type: probabilistic
  param: 0.01

Rate Limiting Sampling

# Sample max 100 traces per second
sampler:
  type: ratelimiting
  param: 100

Adaptive Sampling

from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

# Sample based on trace ID (deterministic)
sampler = ParentBased(root=TraceIdRatioBased(0.01))

Trace Analysis

Finding Slow Requests

Jaeger Query:

service=my-service
duration > 1s

Finding Errors

Jaeger Query:

service=my-service
error=true
tags.http.status_code >= 500

Service Dependency Graph

Jaeger automatically generates service dependency graphs showing:

  • Service relationships
  • Request rates
  • Error rates
  • Average latencies

Best Practices

  1. Sample appropriately (1-10% in production)
  2. Add meaningful tags (user_id, request_id)
  3. Propagate context across all service boundaries
  4. Log exceptions in spans
  5. Use consistent naming for operations
  6. Monitor tracing overhead (<1% CPU impact)
  7. Set up alerts for trace errors
  8. Implement distributed context (baggage)
  9. Use span events for important milestones
  10. Document instrumentation standards

Integration with Logging

Correlated Logs

import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

def process_request():
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id

    logger.info(
        "Processing request",
        extra={"trace_id": format(trace_id, '032x')}
    )

Troubleshooting

No traces appearing:

  • Check collector endpoint
  • Verify network connectivity
  • Check sampling configuration
  • Review application logs

High latency overhead:

  • Reduce sampling rate
  • Use batch span processor
  • Check exporter configuration

Reference Files

  • references/jaeger-setup.md - Jaeger installation
  • references/instrumentation.md - Instrumentation patterns
  • assets/jaeger-config.yaml.template - Jaeger configuration

Related Skills

  • prometheus-configuration - For metrics
  • grafana-dashboards - For visualization
  • slo-implementation - For latency SLOs
general reviews

Ratings

4.510 reviews
  • Shikha Mishra· Oct 10, 2024

    distributed-tracing is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Piyush G· Sep 9, 2024

    Keeps context tight: distributed-tracing is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Chaitanya Patil· Aug 8, 2024

    Registry listing for distributed-tracing matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Sakshi Patil· Jul 7, 2024

    distributed-tracing reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Ganesh Mohane· Jun 6, 2024

    I recommend distributed-tracing for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Oshnikdeep· May 5, 2024

    Useful defaults in distributed-tracing — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Dhruvi Jain· Apr 4, 2024

    distributed-tracing has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Rahul Santra· Mar 3, 2024

    Solid pick for teams standardizing on skills: distributed-tracing is focused, and the summary matches what you get after install.

  • Pratham Ware· Feb 2, 2024

    We added distributed-tracing from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Yash Thakker· Jan 1, 2024

    distributed-tracing fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.