tilegym-converting-cutile-to-triton
Converts cuTile GPU kernels (@ct.kernel) to Triton (@triton.jit). Handles standard in-repo conversion, debugging (cudaErrorIllegalAddress, shape mismatch, numerical mismatch), and mapping cuTile idioms (ct.load/ct.store, ct.Constant, ct.launch) to Triton equivalents. Covers dual-kernel layout flags (e.g. transpose=True/False + autotune grid via META) per translations/advanced-patterns.md. Use when converting, porting, or translating cuTile kernels to Triton, or debugging existing Triton translations.
Works with
0
total installs
0
this week
1.7K
GitHub stars
0
upvotes
Install Skill
Run in your terminal
0
installs
0
this week
1.7K
stars
Installation Guide
How to use tilegym-converting-cutile-to-triton on Cursor
AI-first code editor with Composer
Prerequisites
Before installing skills in Cursor, ensure your development environment meets these requirements:
- ›Cursor installed and configured on your machine
- ›Node.js 16+ with npm — verify with
node --version - ›Active project directory where you want to add
tilegym-converting-cutile-to-triton
Run the install command
Execute the skills CLI command in your project's root directory to begin installation:
Fetches tilegym-converting-cutile-to-triton from nvidia/skills and configures it for Cursor.
Select Cursor when prompted
The CLI shows a list of agents. Use arrow keys and space to select Cursor:
Verify installation
Confirm successful installation by checking the skill directory location:
Restart Cursor to activate tilegym-converting-cutile-to-triton. Access via /tilegym-converting-cutile-to-triton in your agent's command palette.
Security Notice
We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.
Skills execute code in your environment. Always review source, verify the publisher, and test in isolation before production.
Documentation
| name | tilegym-converting-cutile-to-triton |
| version | "1.0.0" |
| description | Converts cuTile GPU kernels (@ct.kernel) to Triton (@triton.jit). Handles standard in-repo conversion, debugging (cudaErrorIllegalAddress, shape mismatch, numerical mismatch), and mapping cuTile idioms (ct.load/ct.store, ct.Constant, ct.launch) to Triton equivalents. Covers dual-kernel layout flags (e.g. transpose=True/False + autotune grid via META) per translations/advanced-patterns.md. Use when converting, porting, or translating cuTile kernels to Triton, or debugging existing Triton translations. |
| license | CC-BY-4.0 AND Apache-2.0 |
| tools | - Read - Write - Grep - Glob - Bash |
| metadata | author: "TileGym Team <[email protected]>" tags: - cutile - triton - conversion - gpu - kernel |
cuTile → Triton Conversion
Convert @ct.kernel kernels to @triton.jit. API mapping: references/api-mapping.md (cuTile → Triton).
In this skill’s Markdown, Triton launch syntax kernel[grid](…) uses Unicode brackets so link checkers do not parse [grid](…) as a hyperlink; use normal ASCII brackets in real Triton code.
Instructions
Follow the phase-gated workflow in translations/workflow.md. Every conversion should go through analyze → convert → validate → test → benchmark, with explicit gates before moving on. Use the documents in Workflow Selection when the task matches a special case (errors, layout flags, perf).
-
Optimization strategy (perf-sensitive / attention) — If the op is attention, FMHA, sliding window, soft cap, or GQA (e.g. Gemma
gemma_attention), read references/optimization-strategy.md before converting the inner loop, then apply §4 Gemma FMHA checklist. For other GEMM/BMM/attention-adjacent kernels, still skim §2–§3 of that file after TMA is done. -
Select path — Existing TileGym op: standard mode in
translations/workflow.md. If the cuTile source usestranspose/transpose_v, dual layouts, or MLA-style paths, read translations/advanced-patterns.md before writing Triton (two kernels +METAgrid, not one kernel +tl.trans). -
Pre-flight — Run the Pre-flight Analysis grep commands on the cuTile source. Count
@ct.kerneldefinitions; note TMA-relevantct.load/ct.store,ct.launch,Constant, and layout flags. -
Read mapping — Keep references/api-mapping.md open for cuTile → Triton API pairs. For runtime failures (illegal address, dtype, strides), use references/debugging.md.
-
Convert — Copy the Conversion Checklist into a todo list and execute in order. Structure and file placement: translations/file-structure.md. Mandatory: any 2D+ block-shaped tile load/store uses
tl.make_tensor_descriptor(TMA), not rawtl.load(ptr+offs, mask=…)for full tiles—skipping this is the most common source of large regressions. Host side: Triton bracket launch <code>kernel[grid](args)</code> with tuple orlambda META: (…)for autotune; noct.launch. -
Validate — Syntax-check the new Triton module; run the relevant TileGym pytest targets for the op:
pytest tests/ops/test_<op>.py -k "triton" -vs. Fix failures before benchmarking. -
Benchmark — Compare Triton vs cuTile on perf tests. If Triton is clearly slower, follow PERFORMANCE ANALYSIS (Phase c2t-5) in translations/workflow.md and references/optimizing-reference.md for GEMM/BMM/attention; use references/optimization-strategy.md as the ordered checklist. If you see 10–50× slowdowns, read CRITICAL PERFORMANCE PATTERNS in that same workflow file first.
Execution rules (MUST):
- Create and track the conversion checklist (e.g. TodoWrite) before editing kernel code; complete steps in order—do not skip pre-flight or TMA decisions.
- For attention / FMHA / Gemma / GQA / soft cap / sliding window: read references/optimization-strategy.md and apply §4 before treating the conversion as optimized.
- Do not ship raw pointer+mask 2D+ tile loads where TMA applies; document any intentional exception.
- If tests or benchmarks fail a gate, stop and fix before declaring the conversion done—do not stack unverified changes.
Workflow Selection
- Existing TileGym op → Standard Mode: translations/workflow.md
- Errors (
cudaErrorIllegalAddress, shape mismatch, numerical mismatch) → references/debugging.md - Advanced patterns (TMA, dual layout flags
transpose, autotune +METAgrid, Array.slice, ct.gather().item()) → translations/advanced-patterns.md (MLA-style two kernels, avoid 3–15× regression ontranspose=False). - Performance (Triton kernel slower than cuTile, autotuning, profiling) → translations/workflow.md (section PERFORMANCE ANALYSIS (Phase c2t-5))
- Optimization strategy hub (ordered checklist: advanced-patterns + optimizing-reference) → references/optimization-strategy.md — read first for attention/FMHA/Gemma; then drill into the two source docs as needed
- Optimizing GEMM/BMM/attention (after TMA, or Triton 10–20% slower) → references/optimizing-reference.md — EVEN_K fast path, transpose via pointer arithmetic, grid layout, autotune breadth, epilogue subtile; use these patterns during conversion and before perf sign-off (summarized in optimization-strategy §2–§3)
- Gemma attention / GQA FMHA conversion → references/optimization-strategy.md §4
- Blackwell optimization (complex kernels with iterative algorithms, register pressure, loop unrolling) → references/optimizing-reference.md §9 — TMA descriptors,
loop_unroll_factor, occupancy autotuning, TMEM-friendly block sizes, slab allocator, dual-path kernel design - ⚠️ 10-50x REGRESSION (catastrophic slowdown after conversion) → translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION)
- ⚠️ Good perf on
transpose=Trueonly, collapse ontranspose=False(or opposite) → translations/advanced-patterns.md — §1 Dual layout flag; two@triton.jitkernels +grid = lambda META: (... META["BLOCK_H"] ...)
Pre-flight Analysis (Run BEFORE converting)
# Count kernels (only main kernel gets @triton.jit, helpers stay plain def)
grep "@ct\.kernel" source.py | wc -l
# Check for patterns needing special handling
grep "ct\.transpose\|ct\.permute" source.py # → use tl.trans/tl.permute
grep "ct\.astype" source.py # → use .to(dtype)
grep "ct\.load\|ct\.store" source.py # → TMA for 2D+ (tl.make_tensor_descriptor), NOT raw tl.load(ptr+offs)
grep "ct\.launch" source.py # → bracket launch: kernel then [grid] then (args)
grep "ct\.Constant\|ct\.ConstInt" source.py # → tl.constexpr
grep "ct\.cdiv" source.py # → triton.cdiv (host) or Python (a+b-1)//b
grep "ct\.bid\|ct\.num_blocks" source.py # → tl.program_id/tl.num_programs
grep "1 << .*\.bit_length" source.py # → triton.next_power_of_2 if needed
grep "transpose\|transpose_v" source.py # → if hit, read translations/advanced-patterns.md (dual kernels + META grid)
Conversion Checklist
Copy this checklist and track progress:
Conversion Progress:
[ ] Step 0 (attention / Gemma FMHA / GQA / soft cap / sliding window): Read [references/optimization-strategy.md](./references/optimization-strategy.md) and apply §4 checklist before inner-loop Triton
[ ] Step 1: Pre-flight — run grep commands above, note special patterns and 2D+ loads (→ TMA)
[ ] Step 2: Analyze source cuTile kernel (identify patterns, shapes, dtypes)
[ ] Step 3: Create Triton file with correct structure (see translations/file-structure.md)
[ ] Step 4: Convert kernel signature (tensor args → pointer args, Constant → constexpr)
[ ] Step 4b: TMA (MANDATORY for 2D+ loads) — use tl.make_tensor_descriptor for every 2D+ tile load/store; do NOT ship raw tl.load(ptr+offs,mask) for block-shaped access (see workflow.md § TMA OPTIMIZATION)
[ ] Step 5: Convert kernel body (apply gotchas table below + API mapping)
[ ] Step 6: Convert host wrapper (grid tuple/lambda, bracket-style launch: kernel, grid, then arguments; no ct.launch); call triton.set_allocator(alloc_fn) if using TMA
[ ] Step 7: Validate — run pytest or syntax check on Triton file
[ ] Step 8: Test — run pytest, verify X passed 0 failed
[ ] Step 9: If test fails → fix → re-validate → re-test (loop until green)
[ ] Step 10: Benchmark — run perf test, compare vs cuTile (see workflow.md § PERFORMANCE ANALYSIS)
[ ] Step 10b: If GEMM/BMM/attention and Triton >20% slower → walk [references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3 then [references/optimizing-reference.md](./references/optimizing-reference.md) (EVEN_K, transpose, grid, autotune, epilogue subtile), then re-benchmark
[ ] Step 10c: If op has `transpose` / layout flag → read [translations/advanced-patterns.md](./translations/advanced-patterns.md); verify **separate kernels** per layout (not transpose-kernel + `tl.trans`); **autotuned** launches use `lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — no fixed `BLOCK_H`/`BLOCK_N` through `apply()` unless autotune is disabled
Post-conversion Verification (TMA is mandatory for 2D+ loads):
[ ] TMA: All 2D+ tile loads use tl.make_tensor_descriptor(...).load([...]); no raw ptr+mask for block-shaped 2D+ access (else 5x-20x regression)
[ ] Grid uses tuple or lambda (not 3-tuple required like cuTile)
[ ] Triton autotune added if cuTile op used kernel_configs/autotune (see workflow § PERFORMANCE ANALYSIS)
[ ] Host grid uses triton.cdiv where appropriate (not (a+b-1)//b only)
[ ] Pointer/offset indexing: Triton uses element offsets (ptr + offs), not block index in tl.load (or use TMA descriptor)
[ ] ct.astype(x, dtype) → x.to(dtype) in Triton
[ ] ct.mma(a, b, acc=acc) → tl.dot(a, b, acc) (no keyword in Triton)
[ ] Optional/None args: Triton allows None in kernel args if desired (cuTile required dummy+flag)
[ ] Masking applied when BLOCK_SIZE > actual dimension (same as cuTile); with TMA, masks can often be removed for full tiles
[ ] Reduction divisor uses actual_size, NOT BLOCK_SIZE
[ ] fp32/tf32: Triton defaults allow_tf32=True; match cuTile behavior if you had explicit tf32 cast
[ ] If any 2D+ load uses raw ptr+mask (exception only): document WHY TMA was not used
[ ] tl.assume() alignment hints added for strides and pointers
Gotchas (Most Common Translation Errors) {#gotchas-most-common-translation-errors}
Comprehensive table of patterns that frequently break or regress when porting @ct.kernel to @triton.jit — mma accumulator, type cast, grid, TMA usage, dtype handling, layout flags, batched matmul, etc.
See: references/gotchas.md — read this BEFORE writing the Triton kernel.
Performance Gotchas (10-50x Regression Risk) {#performance-gotchas-10-50x-regression-risk}
⚠️ These cause CATASTROPHIC slowdowns. Check BEFORE benchmarking.
Patterns and their impact: TMA vs raw ptr+mask (5-20×), autotune vs fixed tile sizes (2-3×), broadcast_to + tl.dot (10-50×), extract_slice chains (2-5×), and more.
See: references/performance-gotchas.md — full regression-risk table.
Full details: translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION).
Full API mapping: references/api-mapping.md.
Triton math dtype (erf/erfc/exp/log/sqrt) and the "don't substitute erf with tanh" pattern: references/debugging.md — section Triton Math Function Dtype Requirements (CRITICAL).
Optimization strategy (hub)
File: references/optimization-strategy.md
Summarizes translations/advanced-patterns.md (layout flags, dual kernels, autotune+META, batched launch, Blackwell pointers) and references/optimizing-reference.md (post-TMA micro-opts, §9) into §1–§3 plus a mandatory §4 Gemma FMHA checklist.
Rule: For attention / FMHA / Gemma-style conversions, open optimization-strategy in the same session as workflow — do not rely on TMA alone for perf sign-off.
Reference Documents {#reference-documents}
Read from cuTile → Triton perspective. Core files live in this skill under ``.
| Category | Document | Content |
|---|---|---|
| Strategy | optimization-strategy.md | Ordered hub: advanced-patterns + optimizing-reference; §4 Gemma FMHA mandatory checklist |
| Workflows | translations/workflow.md | Standard c2t conversion (phases + checklist) |
| translations/file-structure.md | Where to place Triton files when converting from cuTile | |
| translations/advanced-patterns.md | Dual layout flags (transpose), autotune + META grid, MLA-style two kernels | |
| API | api-mapping.md | cuTile → Triton mapping |
| optimizing-reference.md | GEMM/BMM/attention optimizations (EVEN_K, transpose, grid, autotune, epilogue subtile) | |
| Gotchas | gotchas.md | Common cuTile→Triton translation errors (mma, dtype, grid, TMA, layout flags) |
| performance-gotchas.md | 10-50× regression-risk table (TMA vs ptr+mask, broadcast_to, extract_slice chains, autotune) | |
| Testing & errors | references/debugging.md | Triton runtime errors (cudaErrorIllegalAddress, pointer type, stride overflow) |
Worked Examples
Use cutile_kernel.py as source and triton_kernel.py as target:
| Example | Directory | Complexity |
|---|---|---|
| Vector Add | examples/01_vector_add/ | Basic |
| Softmax | examples/02_softmax/ | Intermediate |
| LayerNorm | examples/03_layernorm/ | Intermediate |
| MatMul | examples/04_matmul/ | Advanced |
| Attention | examples/05_attention/ | Advanced |
Read cutile_kernel.py first, then triton_kernel.py, to see the inverse mapping.
⚠️ MANDATORY COMPLETION CHECKLIST (DO NOT SKIP)
A conversion is NOT COMPLETE until ALL items are checked. Copy and complete:
MANDATORY COMPLETION GATES:
[ ] 1. CORRECTNESS: pytest passes with 0 failures
Command: python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
Gate: "X passed, 0 failed"
[ ] 2. TMA OPTIMIZATION: All 2D+ tile loads use tl.make_tensor_descriptor
Verify: grep -n "tl.load.*mask" triton_file.py | wc -l # Should be 0 for 2D+ ops
Skip = 5-20x performance regression
[ ] 3. PERFORMANCE TEST: Triton within 20% of cuTile baseline
Command: python -m pytest {test_path} -k "test_perf" --print-record -v
OR: Run benchmark script: cd tests/benchmark && python bench_{op}.py
Gate: Triton TFLOPS >= 0.8 * CuTile TFLOPS
[ ] 4. PERFORMANCE COMPARISON RECORDED:
Document results:
| Config | Triton (TFLOPS) | CuTile (TFLOPS) | Ratio |
|--------|-----------------|-----------------|-------|
| [fill] | [fill] | [fill] | [fill]|
CONVERSION COMPLETE: All 4 gates passed? → YES / NO
Why this matters:
- Gate 1 catches functional bugs
- Gate 2 prevents catastrophic 5-20x regressions (most common mistake)
- Gate 3 validates that optimization was effective
- Gate 4 creates accountability record
If any gate fails: Fix and re-verify before declaring complete.
List & Monetize Your Skill
Submit your Claude Code skill and start earning
Use Cases
Task Automation & Efficiency
Automate repetitive workflows and reduce manual effort
Example
Generate reports, summarize documents, draft communications
Save 3-5 hours per week on routine tasks
Knowledge Enhancement
Learn new skills, understand complex topics, get expert guidance
Example
Explain concepts, provide examples, suggest learning resources
Accelerate learning and skill development by 2x
Quality Improvement
Enhance output quality through reviews, suggestions, and refinements
Example
Review drafts, suggest improvements, catch errors
Improve work quality by 30-40% with less effort
Implementation Guide
Prerequisites
- ›Claude Desktop or compatible AI client with skill support
- ›Clear understanding of task or problem to solve
- ›Willingness to iterate and refine outputs
Time Estimate
15-45 minutes depending on use case complexity
Steps
- 1Install skill using provided installation command
- 2Test with simple use case relevant to your work
- 3Evaluate output quality and relevance
- 4Iterate on prompts to improve results
- 5Integrate into regular workflow if valuable
Common Pitfalls
- ⚠Expecting perfect results without iteration
- ⚠Not providing enough context in prompts
- ⚠Using skill for tasks outside its intended scope
- ⚠Accepting outputs without review and validation
Best Practices
✓ Do
- +Start with clear, specific prompts
- +Provide relevant context and constraints
- +Review and refine all outputs before using
- +Iterate to improve output quality
- +Document successful prompt patterns
✗ Don't
- −Don't use without understanding skill limitations
- −Don't skip validation of outputs
- −Don't share sensitive information in prompts
- −Don't expect skill to replace human judgment
💡 Pro Tips
- ★Be specific about desired format and style
- ★Ask for multiple options to choose from
- ★Request explanations to understand reasoning
- ★Combine AI efficiency with human expertise
When to Use This
✓ Use when
Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.
✗ Avoid when
Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.
Learning Path
- 1Familiarize yourself with skill capabilities and limitations
- 2Start with low-risk, non-critical tasks
- 3Progress to more complex and valuable use cases
- 4Build expertise through regular use and experimentation
Related Skills
Reviews
- LLayla Khan★★★★★Dec 28, 2024
Solid pick for teams standardizing on skills: tilegym-converting-cutile-to-triton is focused, and the summary matches what you get after install.
- AArjun Abbas★★★★★Dec 20, 2024
tilegym-converting-cutile-to-triton fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- NNoor Robinson★★★★★Dec 12, 2024
I recommend tilegym-converting-cutile-to-triton for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- CChaitanya Patil★★★★★Dec 8, 2024
tilegym-converting-cutile-to-triton has been reliable in day-to-day use. Documentation quality is above average for community skills.
- MMichael Khanna★★★★★Dec 8, 2024
We added tilegym-converting-cutile-to-triton from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- PPiyush G★★★★★Nov 27, 2024
Keeps context tight: tilegym-converting-cutile-to-triton is the kind of skill you can hand to a new teammate without a long onboarding doc.
- HHarper Sanchez★★★★★Nov 27, 2024
tilegym-converting-cutile-to-triton fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
- LLayla Garcia★★★★★Nov 15, 2024
I recommend tilegym-converting-cutile-to-triton for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- CCharlotte Khan★★★★★Nov 11, 2024
We added tilegym-converting-cutile-to-triton from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- RRahul Santra★★★★★Nov 7, 2024
Solid pick for teams standardizing on skills: tilegym-converting-cutile-to-triton is focused, and the summary matches what you get after install.
showing 1-10 of 40
Discussion
Comments — not star reviews- No comments yet — start the thread.