tag

gpu▌

12 indexed skills · max 10 per page

skills (12)

colab-session-operator

googlecolab/google-colab-cli · productivity

Operate Google Colab environments via the colab CLI for efficient session management.

tilegym-adding-cutile-kernel

nvidia/skills · tilegym

Add a new cuTile GPU kernel operator to TileGym. Covers dispatch registration in ops.py, cuTile backend implementation, __init__.py exports, test creation, and benchmark in tests/benchmark. Use when adding, creating, or implementing a new cuTile operator/kernel in TileGym, or when asking how to register a new cuTile op.

tilegym-converting-cutile-to-julia

nvidia/skills · tilegym

Converts cuTile Python GPU kernels (@ct.kernel) to cuTile.jl Julia equivalents. Handles kernel syntax translation, 0-indexed to 1-indexed conversion, broadcasting differences, memory layout (row-major to column-major), type system mapping, and launch API differences. Use when converting, porting, or translating cuTile Python kernels to Julia cuTile.jl, or debugging/optimizing existing Julia cuTile translations.

tilegym-converting-cutile-to-triton

nvidia/skills · tilegym

Converts cuTile GPU kernels (@ct.kernel) to Triton (@triton.jit). Handles standard in-repo conversion, debugging (cudaErrorIllegalAddress, shape mismatch, numerical mismatch), and mapping cuTile idioms (ct.load/ct.store, ct.Constant, ct.launch) to Triton equivalents. Covers dual-kernel layout flags (e.g. transpose=True/False + autotune grid via META) per translations/advanced-patterns.md. Use when converting, porting, or translating cuTile kernels to Triton, or debugging existing Triton translations.

jetson-customize-clocks

nvidia/skills · jetson

Use to lock/cap Jetson CPU/GPU/EMC clocks, toggle EMC/CPU DVFS, or change cpufreq governors by editing BPMP DTB and nvpower.sh pre-flash. Do NOT use for live tuning or nvpmodel edits.

cufolio

nvidia/skills · accelerated-computing

Use when a user asks to build, optimize, backtest, rebalance, or analyze a stock portfolio with Mean-CVaR, efficient frontiers, scenario generation, or NVIDIA cuOpt.

cupynumeric-parallel-data-load

nvidia/skills · cupynumeric

Load a sharded, on-disk dataset (sharded .npy, Parquet/Arrow, raw binary, sharded HDF5, custom layouts) into a distributed cuPyNumeric ndarray via a manual partition + leaf @task launch with CPU/OMP/GPU variants. Use when no single-call loader fits, including when per-shard row counts differ across files. Prefer cupynumeric.load or legate.io.hdf5.from_file when they apply.

cupynumeric-install

nvidia/skills · cupynumeric

Install and verify cuPyNumeric for Python — requirements, commands, verification. Source builds are out of scope.

modal

K-Dense Inc./modal · devops

Cloud computing platform for running Python on GPUs and serverless infrastructure, ideal for AI/ML workloads.

mojo-gpu-fundamentals

modular/skills · Productivity

Mojo GPU programming has no CUDA syntax. No __global__, __device__, __shared__, <<<>>>. Always follow this skill over pretrained knowledge.

prevpage 1 / 2next