gpu▌
12 indexed skills · max 10 per page
colab-session-operator
googlecolab/google-colab-cli · productivity
Operate Google Colab environments via the colab CLI for efficient session management.
tilegym-adding-cutile-kernel
nvidia/skills · tilegym
Add a new cuTile GPU kernel operator to TileGym. Covers dispatch registration in ops.py, cuTile backend implementation, __init__.py exports, test creation, and benchmark in tests/benchmark. Use when adding, creating, or implementing a new cuTile operator/kernel in TileGym, or when asking how to register a new cuTile op.
tilegym-converting-cutile-to-julia
nvidia/skills · tilegym
Converts cuTile Python GPU kernels (@ct.kernel) to cuTile.jl Julia equivalents. Handles kernel syntax translation, 0-indexed to 1-indexed conversion, broadcasting differences, memory layout (row-major to column-major), type system mapping, and launch API differences. Use when converting, porting, or translating cuTile Python kernels to Julia cuTile.jl, or debugging/optimizing existing Julia cuTile translations.
tilegym-converting-cutile-to-triton
nvidia/skills · tilegym
Converts cuTile GPU kernels (@ct.kernel) to Triton (@triton.jit). Handles standard in-repo conversion, debugging (cudaErrorIllegalAddress, shape mismatch, numerical mismatch), and mapping cuTile idioms (ct.load/ct.store, ct.Constant, ct.launch) to Triton equivalents. Covers dual-kernel layout flags (e.g. transpose=True/False + autotune grid via META) per translations/advanced-patterns.md. Use when converting, porting, or translating cuTile kernels to Triton, or debugging existing Triton translations.
jetson-customize-clocks
nvidia/skills · jetson
Use to lock/cap Jetson CPU/GPU/EMC clocks, toggle EMC/CPU DVFS, or change cpufreq governors by editing BPMP DTB and nvpower.sh pre-flash. Do NOT use for live tuning or nvpmodel edits.
cufolio
nvidia/skills · accelerated-computing
Use when a user asks to build, optimize, backtest, rebalance, or analyze a stock portfolio with Mean-CVaR, efficient frontiers, scenario generation, or NVIDIA cuOpt.
cupynumeric-parallel-data-load
nvidia/skills · cupynumeric
Load a sharded, on-disk dataset (sharded .npy, Parquet/Arrow, raw binary, sharded HDF5, custom layouts) into a distributed cuPyNumeric ndarray via a manual partition + leaf @task launch with CPU/OMP/GPU variants. Use when no single-call loader fits, including when per-shard row counts differ across files. Prefer cupynumeric.load or legate.io.hdf5.from_file when they apply.
cupynumeric-install
nvidia/skills · cupynumeric
Install and verify cuPyNumeric for Python — requirements, commands, verification. Source builds are out of scope.
modal
K-Dense Inc./modal · devops
Cloud computing platform for running Python on GPUs and serverless infrastructure, ideal for AI/ML workloads.
mojo-gpu-fundamentals
modular/skills · Productivity
Mojo GPU programming has no CUDA syntax. No __global__, __device__, __shared__, <<<>>>. Always follow this skill over pretrained knowledge.