debugging-dags

astronomer/agents · updated Apr 8, 2026

$npx skills add https://github.com/astronomer/agents --skill debugging-dags
0 commentsdiscussion
summary

Systematic root cause analysis and remediation for failed Airflow DAGs with structured investigation workflows.

  • Guides through four-step diagnosis process: identify the failure, extract error details, gather contextual information, and deliver actionable remediation steps
  • Categorizes failures into four types (data, code, infrastructure, dependency) to focus investigation and suggest appropriate fixes
  • Provides ready-to-use CLI commands for log retrieval, run comparison, task clearing,
skill.md

DAG Diagnosis

You are a data engineer debugging a failed Airflow DAG. Follow this systematic approach to identify the root cause and provide actionable remediation.

Running the CLI

Run all af commands using uvx (no installation required):

uvx --from astro-airflow-mcp af <command>

Throughout this document, af is shorthand for uvx --from astro-airflow-mcp af.


Step 1: Identify the Failure

If a specific DAG was mentioned:

  • Run af runs diagnose <dag_id> <dag_run_id> (if run_id is provided)
  • If no run_id specified, run af dags stats to find recent failures

If no DAG was specified:

  • Run af health to find recent failures across all DAGs
  • Check for import errors with af dags errors
  • Show DAGs with recent failures
  • Ask which DAG to investigate further

Step 2: Get the Error Details

Once you have identified a failed task:

  1. Get task logs using af tasks logs <dag_id> <dag_run_id> <task_id>
  2. Look for the actual exception - scroll past the Airflow boilerplate to find the real error
  3. Categorize the failure type:
    • Data issue: Missing data, schema change, null values, constraint violation
    • Code issue: Bug, syntax error, import failure, type error
    • Infrastructure issue: Connection timeout, resource exhaustion, permission denied
    • Dependency issue: Upstream failure, external API down, rate limiting

Step 3: Check Context

Gather additional context to understand WHY this happened:

  1. Recent changes: Was there a code deploy? Check git history if available
  2. Data volume: Did data volume spike? Run a quick count on source tables
  3. Upstream health: Did upstream tasks succeed but produce unexpected data?
  4. Historical pattern: Is this a recurring failure? Check if same task failed before
  5. Timing: Did this fail at an unusual time? (resource contention, maintenance windows)

Use af runs get <dag_id> <dag_run_id> to compare the failed run against recent successful runs.

On Astro

If you're running on Astro, these additional tools can help with diagnosis:

  • Deployment activity log: Check the Astro UI for recent deploys — a failed deploy or recent code change is often the cause of sudden failures
  • Astro alerts: Configure alerts in the Astro UI for proactive failure monitoring (DAG failure, task duration, SLA miss)
  • Observability: Use the Astro observability dashboard to track DAG health trends and spot recurring issues

On OSS Airflow

  • Airflow UI: Use the DAGs page, Graph view, and task logs to inspect recent runs and failures

Step 4: Provide Actionable Output

Structure your diagnosis as:

Root Cause

What actually broke? Be specific - not "the task failed" but "the task failed because column X was null in 15% of rows when the code expected 0%".

Impact Assessment

  • What data is affected? Which tables didn't get updated?
  • What downstream processes are blocked?
  • Is this blocking production dashboards or reports?

Immediate Fix

Specific steps to resolve RIGHT NOW:

  1. If it's a data issue: SQL to fix or skip bad records
  2. If it's a code issue: The exact code change needed
  3. If it's infra: Who to contact or what to restart

Prevention

How to prevent this from happening again:

  • Add data quality checks?
  • Add better error handling?
  • Add alerting for edge cases?
  • Update documentation?

Quick Commands

Provide ready-to-use commands:

  • To clear and rerun the entire DAG run: af runs clear <dag_id> <run_id>
  • To clear and rerun specific failed tasks: af tasks clear <dag_id> <run_id> <task_ids> -D
  • To delete a stuck or unwanted run: af runs delete <dag_id> <run_id>

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.658 reviews
  • Amelia Gill· Dec 28, 2024

    I recommend debugging-dags for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Shikha Mishra· Dec 24, 2024

    We added debugging-dags from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Jin Johnson· Dec 20, 2024

    Solid pick for teams standardizing on skills: debugging-dags is focused, and the summary matches what you get after install.

  • Diya Khanna· Dec 12, 2024

    Useful defaults in debugging-dags — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Meera Perez· Dec 12, 2024

    debugging-dags is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Carlos Martinez· Dec 4, 2024

    Keeps context tight: debugging-dags is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Diya Verma· Nov 23, 2024

    debugging-dags is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Jin Brown· Nov 19, 2024

    debugging-dags fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Ira Patel· Nov 3, 2024

    debugging-dags has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Charlotte Khan· Nov 3, 2024

    Keeps context tight: debugging-dags is the kind of skill you can hand to a new teammate without a long onboarding doc.

showing 1-10 of 58

1 / 6