debugging-dags▌
astronomer/agents · updated Apr 8, 2026
Systematic root cause analysis and remediation for failed Airflow DAGs with structured investigation workflows.
- ›Guides through four-step diagnosis process: identify the failure, extract error details, gather contextual information, and deliver actionable remediation steps
- ›Categorizes failures into four types (data, code, infrastructure, dependency) to focus investigation and suggest appropriate fixes
- ›Provides ready-to-use CLI commands for log retrieval, run comparison, task clearing,
DAG Diagnosis
You are a data engineer debugging a failed Airflow DAG. Follow this systematic approach to identify the root cause and provide actionable remediation.
Running the CLI
Run all af commands using uvx (no installation required):
uvx --from astro-airflow-mcp af <command>
Throughout this document, af is shorthand for uvx --from astro-airflow-mcp af.
Step 1: Identify the Failure
If a specific DAG was mentioned:
- Run
af runs diagnose <dag_id> <dag_run_id>(if run_id is provided) - If no run_id specified, run
af dags statsto find recent failures
If no DAG was specified:
- Run
af healthto find recent failures across all DAGs - Check for import errors with
af dags errors - Show DAGs with recent failures
- Ask which DAG to investigate further
Step 2: Get the Error Details
Once you have identified a failed task:
- Get task logs using
af tasks logs <dag_id> <dag_run_id> <task_id> - Look for the actual exception - scroll past the Airflow boilerplate to find the real error
- Categorize the failure type:
- Data issue: Missing data, schema change, null values, constraint violation
- Code issue: Bug, syntax error, import failure, type error
- Infrastructure issue: Connection timeout, resource exhaustion, permission denied
- Dependency issue: Upstream failure, external API down, rate limiting
Step 3: Check Context
Gather additional context to understand WHY this happened:
- Recent changes: Was there a code deploy? Check git history if available
- Data volume: Did data volume spike? Run a quick count on source tables
- Upstream health: Did upstream tasks succeed but produce unexpected data?
- Historical pattern: Is this a recurring failure? Check if same task failed before
- Timing: Did this fail at an unusual time? (resource contention, maintenance windows)
Use af runs get <dag_id> <dag_run_id> to compare the failed run against recent successful runs.
On Astro
If you're running on Astro, these additional tools can help with diagnosis:
- Deployment activity log: Check the Astro UI for recent deploys — a failed deploy or recent code change is often the cause of sudden failures
- Astro alerts: Configure alerts in the Astro UI for proactive failure monitoring (DAG failure, task duration, SLA miss)
- Observability: Use the Astro observability dashboard to track DAG health trends and spot recurring issues
On OSS Airflow
- Airflow UI: Use the DAGs page, Graph view, and task logs to inspect recent runs and failures
Step 4: Provide Actionable Output
Structure your diagnosis as:
Root Cause
What actually broke? Be specific - not "the task failed" but "the task failed because column X was null in 15% of rows when the code expected 0%".
Impact Assessment
- What data is affected? Which tables didn't get updated?
- What downstream processes are blocked?
- Is this blocking production dashboards or reports?
Immediate Fix
Specific steps to resolve RIGHT NOW:
- If it's a data issue: SQL to fix or skip bad records
- If it's a code issue: The exact code change needed
- If it's infra: Who to contact or what to restart
Prevention
How to prevent this from happening again:
- Add data quality checks?
- Add better error handling?
- Add alerting for edge cases?
- Update documentation?
Quick Commands
Provide ready-to-use commands:
- To clear and rerun the entire DAG run:
af runs clear <dag_id> <run_id> - To clear and rerun specific failed tasks:
af tasks clear <dag_id> <run_id> <task_ids> -D - To delete a stuck or unwanted run:
af runs delete <dag_id> <run_id>
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.6★★★★★58 reviews- ★★★★★Amelia Gill· Dec 28, 2024
I recommend debugging-dags for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Shikha Mishra· Dec 24, 2024
We added debugging-dags from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Jin Johnson· Dec 20, 2024
Solid pick for teams standardizing on skills: debugging-dags is focused, and the summary matches what you get after install.
- ★★★★★Diya Khanna· Dec 12, 2024
Useful defaults in debugging-dags — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Meera Perez· Dec 12, 2024
debugging-dags is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Carlos Martinez· Dec 4, 2024
Keeps context tight: debugging-dags is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Diya Verma· Nov 23, 2024
debugging-dags is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Jin Brown· Nov 19, 2024
debugging-dags fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- ★★★★★Ira Patel· Nov 3, 2024
debugging-dags has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Charlotte Khan· Nov 3, 2024
Keeps context tight: debugging-dags is the kind of skill you can hand to a new teammate without a long onboarding doc.
showing 1-10 of 58