tao✦ Official

tao-run-on-slurm

Remote SLURM GPU cluster execution over SSH with sbatch/srun, Pyxis/Enroot containers, and Lustre-backed

nvidia/skillsUpdated Jun 23, 2026

Works with

Claude CodeCursorClineWindsurfCodexGooseGitHub CopilotZed

0

total installs

0

this week

1.7K

GitHub stars

0

upvotes

Install Skill

Run in your terminal

$npx skills install nvidia/skills/tao-run-on-slurm

0

installs

0

this week

1.7K

stars

Installation Guide

How to use tao-run-on-slurm on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your machine
  • Node.js 16+ with npm — verify with node --version
  • Active project directory where you want to add tao-run-on-slurm
2

Run the install command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills install nvidia/skills/tao-run-on-slurm

Fetches tao-run-on-slurm from nvidia/skills and configures it for Cursor.

3

Select Cursor when prompted

The CLI shows a list of agents. Use arrow keys and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ────────────────
│ · Cline · Codex · Goose · Windsurf
│ ●Cursor(selected)
│ · Cursor · Aider · Continue
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/tao-run-on-slurm

Restart Cursor to activate tao-run-on-slurm. Access via /tao-run-on-slurm in your agent's command palette.

Security Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your environment. Always review source, verify the publisher, and test in isolation before production.

Documentation

name
tao-run-on-slurm
description
Remote SLURM GPU cluster execution over SSH with sbatch/srun, Pyxis/Enroot containers, and Lustre-backed results. Use when running TAO training/eval/inference jobs on an on-prem or DGX SLURM cluster. Trigger phrases include "run on SLURM", "submit sbatch", "DGX SLURM cluster", "Pyxis/Enroot container", "Lustre dataset".
license
Apache-2.0
compatibility
Requires SSH access to a SLURM login node (passwordless via key auth) and SLURM_USER + SLURM_HOSTNAME env vars. The TAO SDK with the slurm extra (pip install 'nvidia-tao-sdk[slurm]') is needed only if you want Job handles, S3 I/O wrapping, or run-folder durability via ActionWorkflow.
metadata
author: NVIDIA Corporation version: "0.1.0"
allowed-tools
Read Bash
tags
- platform - slurm

SLURM

Remote GPU compute platform for clusters managed by SLURM. Jobs are submitted from the TAO service or SDK host to a login node over SSH, staged on a shared filesystem, submitted with sbatch, and executed with srun container support.

When to use

Use SLURM when the user has access to a managed GPU cluster, shared Lustre storage, and scheduler-owned GPU allocation. Do not use SLURM for local files that exist only on the agent machine; data and outputs must be reachable from the cluster.

Preflight + SSH

Confirm SLURM_USER and SLURM_HOSTNAME are exported and passwordless SSH to a login host works (ssh -o BatchMode=yes). Optionally install the TAO SDK wrapper for Job handles + S3 wrapping (nvidia-tao-sdk[slurm], on public PyPI). For private nvcr.io images, install ~/.config/enroot/.credentials on the cluster once per (cluster, user): Pyxis/Enroot does not read NGC_KEY from the job env, and without persistent credentials, auth-gated pulls fail with "Could not process JSON input" at job startup. Install it via the printf | ssh heredoc so the NGC_KEY value never lands in shell history, intermediate files, or chat output; never cat/echo the value.

If a preflight check fails, the agent prompts the user to authorize the install/fix via Bash. Pip-installable Python requirements are the exception: install them automatically, then rerun preflight.

See references/slurm-ssh-credentials.md for the full preflight script, the enroot-credentials heredoc, prerequisite key setup (keypair, ssh-copy-id, known_hosts, container key mounts, 2FA handling), and the SSH failure remediation prompt.

Storage

Use shared-filesystem URIs, not local or file:// paths; tao-core rejects local/file paths for remote backends.

  • lustre:///absolute/path for user-provided datasets on Lustre.
  • slurm:// paths may appear in microservices metadata and are converted to Lustre paths before the container starts.

Accept either dataset roots (model skills map them to required files) or direct spec-key paths. After SSH succeeds and before generating scripts, test -e each required dataset path from the login host; if it fails, stop and ask for corrected paths or staged data rather than producing scripts that fail in the first training job. See references/slurm-ssh-credentials.md for root vs. direct-spec modes, backend details, and the results-dir default.

Container execution

tao-core runs TAO containers through Pyxis/Enroot:

  1. Stage compact JSON files for specs, environment, and cloud metadata under <job_dir>/specs, <job_dir>/env, and <job_dir>/meta.
  2. Optionally convert the Docker image to a cached SQSH image with srun -n1 -p <conversion_partition> enroot import.
  3. Write an sbatch script under <job_dir>/sbatch/job_<job_id>.sbatch.
  4. Submit sbatch --export=ALL <script>.
  5. Run the container with srun --container-image=<image> --container-mounts=/lustre.

Accepted image formats: /path/to/image.sqsh, registry#image:tag, docker://registry#image:tag, and ordinary registry/image:tag (converted to Pyxis form when needed). SQSH conversion is cached by image name; for :latest images the cached SQSH is reused unless force_reconvert_latest is enabled.

Monitoring and cancellation

  • Scheduler status comes from the stored SLURM job id via squeue/sacct; TAO terminal status comes from status.json in the shared results folder.
  • While chat monitoring is enabled, keep polling at the requested interval for any non-terminal job (PENDING, RUNNING, or otherwise). Do not stop after a fixed elapsed time such as 30 minutes; long queue waits are normal on shared GPU partitions.
  • Do not send a final response for a non-terminal SLURM job when chat monitoring is enabled. A final response is a detach action; use it only if the user asked to detach/stop or the job reached terminal state.
  • Logs are read over SSH from <job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.out and .err.
  • Cancel by looking up backend_details.slurm_metadata.slurm_job_id and running scancel <slurm_job_id> over SSH. Treat missing or already terminated jobs as successful cancellation.

Status mapping:

  • PENDING -> Pending
  • RUNNING or COMPLETING -> Running
  • COMPLETED -> check status.json
  • FAILED, BOOT_FAIL, DEADLINE, OUT_OF_MEMORY, NODE_FAIL -> retry if logs match retriable infrastructure patterns, otherwise Error
  • CANCELLED, PREEMPTED, REVOKED -> Canceled
  • TIMEOUT -> Error
  • SUSPENDED, STOPPED -> Paused

Required inputs

Ask for these in the SLURM intake; see references/slurm-ssh-credentials.md for the full credential list, microservices schema keys, and defaults.

  • SLURM_USER (required): SSH username for the login node.
  • SLURM_HOSTNAME (required): Comma-separated login hostnames for failover.
  • SLURM_PARTITION (required): Partition list for GPU submission. Packaged default polar,polar3,polar4,grizzly, treated as 4-hour queues.
  • SSH_KEY_PATH (preferred, expected before launch): private key for non-interactive public-key auth. Ask for this first in remediation; prefer it over the SSH_AUTH_SOCK agent-socket fallback.
  • SLURM_BASE_RESULTS_DIR (optional): base shared-filesystem path; default /lustre/fsw/portfolios/edgeai/users/<your-dir> (your per-user Lustre dir).
  • SLURM_ACCOUNT (usually required by site policy): account for #SBATCH --account.

Do not ask for SLURM_ACCOUNT or SLURM_BASE_RESULTS_DIR in the initial intake unless the user says their site requires an account, wants a custom results root, or the workflow cannot proceed without overriding defaults.

Resource defaults

Defaults from tao-core:

  • num_nodes: 1
  • num_gpus: 4
  • max_num_gpus_per_node: 8
  • cpus_per_task: 16
  • time_hours: 4
  • timeout_hours: 3.8
  • max_time_hours: 4
  • container_mounts: /lustre
  • use_requeue: true
  • use_sqsh: true

When generating launchers or wrapper scripts for SLURM, set the wall-time defaults explicitly from the packaged platform resource defaults:

export SLURM_TIME_HOURS="${SLURM_TIME_HOURS:-4}"
export SLURM_TIMEOUT_HOURS="${SLURM_TIMEOUT_HOURS:-3.8}"

Do not default to 12 hours on SLURM. If the user supplies a longer SLURM_TIME_HOURS, verify that the selected partition supports it before submitting. For the packaged default partition list polar,polar3,polar4,grizzly, reject requests above 4 hours and ask for a different partition only if the user actually wants a longer wall time.

When num_gpus is greater than or equal to max_num_gpus_per_node, the handler treats the request as exclusive per node and computes additional nodes from total GPU count when necessary.

Multi-node, SDK, and retries

For multi-node jobs (num_nodes > 1), the SDK builds the sbatch directives and exports the PyTorch-distributed rendezvous env vars automatically: WORLD_SIZE, NUM_GPU_PER_NODE, NODE_RANK, MASTER_ADDR, and MASTER_PORT (29500). TAO entrypoints read WORLD_SIZE + NUM_GPU_PER_NODE and build torchrun internally. Cosmos-RL has special multi-node role handling for controller, policy, and rollout workers.

Use Lustre, not S3, for SLURM job inputs. The GPU allocation starts the moment the job is dispatched, so a long s3:// download at the top of the script burns the allocation, can get the job killed for GPU-idle, and is billed either way. Stage training data on the shared filesystem first and reference it as lustre:///.... S3/HF/NGC pre-fetch is fine for small auxiliary inputs (checkpoints, configs), not training datasets. K8s/Brev do not share this scheduler-idle constraint.

Auto-retry of infrastructure failures (NODE_FAIL, BOOT_FAIL, NCCL transport timeouts, CUDA driver init failures, GPU/IB link-down, OOM-killer node reaping, Xid errors) is automatic in the SDK, with a stable user-facing Job.id across retries. Plain training failures surface immediately so a broken spec does not consume the retry budget. #SBATCH --requeue is enabled by default via SLURM_USE_REQUEUE=true.

See references/slurm-container-execution.md for the full multi-node env-var/sbatch directive detail and table, cluster requirements, the optional TAO SDK path (SlurmSDK, build_entrypoint, ActionWorkflow) with code, the Lustre-not-S3 rule in full, and the failure-mode checklist; references/slurm-execution-sdk.md covers the MAX_JOB_RETRIES retry budget. When the SDK is in scope, read tao-skill-bank:tao-run-platform for the SlurmSDK kwarg reference.

References

  • references/slurm-ssh-credentials.md — preflight script, SSH/key setup, enroot credentials, full credential list, backend details, storage rules, SSH remediation prompt.
  • references/slurm-container-execution.md — container execution steps, monitoring, status mapping, cancellation, multi-node detail, SDK use, Lustre-not-S3, auto-retry, failure modes.
  • references/slurm-preflight-storage.md — extended preflight/storage notes.
  • references/slurm-execution-sdk.md — extended execution/SDK notes.
  • references/detailed-guide.md — navigation map for the split references.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

Get started →

Use Cases

Task Automation & Efficiency

Automate repetitive workflows and reduce manual effort

Example

Generate reports, summarize documents, draft communications

Save 3-5 hours per week on routine tasks

Knowledge Enhancement

Learn new skills, understand complex topics, get expert guidance

Example

Explain concepts, provide examples, suggest learning resources

Accelerate learning and skill development by 2x

Quality Improvement

Enhance output quality through reviews, suggestions, and refinements

Example

Review drafts, suggest improvements, catch errors

Improve work quality by 30-40% with less effort

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client with skill support
  • Clear understanding of task or problem to solve
  • Willingness to iterate and refine outputs

Time Estimate

15-45 minutes depending on use case complexity

Steps

  1. 1Install skill using provided installation command
  2. 2Test with simple use case relevant to your work
  3. 3Evaluate output quality and relevance
  4. 4Iterate on prompts to improve results
  5. 5Integrate into regular workflow if valuable

Common Pitfalls

  • Expecting perfect results without iteration
  • Not providing enough context in prompts
  • Using skill for tasks outside its intended scope
  • Accepting outputs without review and validation

Best Practices

✓ Do

  • +Start with clear, specific prompts
  • +Provide relevant context and constraints
  • +Review and refine all outputs before using
  • +Iterate to improve output quality
  • +Document successful prompt patterns

✗ Don't

  • Don't use without understanding skill limitations
  • Don't skip validation of outputs
  • Don't share sensitive information in prompts
  • Don't expect skill to replace human judgment

💡 Pro Tips

  • Be specific about desired format and style
  • Ask for multiple options to choose from
  • Request explanations to understand reasoning
  • Combine AI efficiency with human expertise

When to Use This

✓ Use when

Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.

✗ Avoid when

Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.

Learning Path

  1. 1Familiarize yourself with skill capabilities and limitations
  2. 2Start with low-risk, non-critical tasks
  3. 3Progress to more complex and valuable use cases
  4. 4Build expertise through regular use and experimentation

Related Skills

Reviews

4.663 reviews
  • D
    Dhruvi JainDec 28, 2024

    Solid pick for teams standardizing on skills: tao-run-on-slurm is focused, and the summary matches what you get after install.

  • S
    Soo AndersonDec 20, 2024

    tao-run-on-slurm is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • A
    Aditi ChenDec 20, 2024

    Solid pick for teams standardizing on skills: tao-run-on-slurm is focused, and the summary matches what you get after install.

  • L
    Luis KhannaDec 8, 2024

    tao-run-on-slurm has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • A
    Anaya IyerDec 4, 2024

    I recommend tao-run-on-slurm for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • S
    Sakura RahmanDec 4, 2024

    Registry listing for tao-run-on-slurm matched our evaluation — installs cleanly and behaves as described in the markdown.

  • L
    Luis TandonNov 27, 2024

    tao-run-on-slurm fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • A
    Anaya GillNov 27, 2024

    Keeps context tight: tao-run-on-slurm is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • H
    Henry LopezNov 23, 2024

    tao-run-on-slurm reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • I
    Isabella BansalNov 23, 2024

    Useful defaults in tao-run-on-slurm — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

showing 1-10 of 63

1 / 7

Discussion

Comments — not star reviews
  • No comments yet — start the thread.