tilegym✦ Official

tilegym-converting-cutile-to-triton

Converts cuTile GPU kernels (@ct.kernel) to Triton (@triton.jit). Handles standard in-repo conversion, debugging (cudaErrorIllegalAddress, shape mismatch, numerical mismatch), and mapping cuTile idioms (ct.load/ct.store, ct.Constant, ct.launch) to Triton equivalents. Covers dual-kernel layout flags (e.g. transpose=True/False + autotune grid via META) per translations/advanced-patterns.md. Use when converting, porting, or translating cuTile kernels to Triton, or debugging existing Triton translations.

nvidia/skillsUpdated Jun 23, 2026

Works with

Claude CodeCursorClineWindsurfCodexGooseGitHub CopilotZed

0

total installs

0

this week

1.7K

GitHub stars

0

upvotes

Install Skill

Run in your terminal

$npx skills install nvidia/skills/tilegym-converting-cutile-to-triton

0

installs

0

this week

1.7K

stars

Installation Guide

How to use tilegym-converting-cutile-to-triton on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your machine
  • Node.js 16+ with npm — verify with node --version
  • Active project directory where you want to add tilegym-converting-cutile-to-triton
2

Run the install command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills install nvidia/skills/tilegym-converting-cutile-to-triton

Fetches tilegym-converting-cutile-to-triton from nvidia/skills and configures it for Cursor.

3

Select Cursor when prompted

The CLI shows a list of agents. Use arrow keys and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ────────────────
│ · Cline · Codex · Goose · Windsurf
│ ●Cursor(selected)
│ · Cursor · Aider · Continue
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/tilegym-converting-cutile-to-triton

Restart Cursor to activate tilegym-converting-cutile-to-triton. Access via /tilegym-converting-cutile-to-triton in your agent's command palette.

Security Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your environment. Always review source, verify the publisher, and test in isolation before production.

Documentation

name
tilegym-converting-cutile-to-triton
version
"1.0.0"
description
Converts cuTile GPU kernels (@ct.kernel) to Triton (@triton.jit). Handles standard in-repo conversion, debugging (cudaErrorIllegalAddress, shape mismatch, numerical mismatch), and mapping cuTile idioms (ct.load/ct.store, ct.Constant, ct.launch) to Triton equivalents. Covers dual-kernel layout flags (e.g. transpose=True/False + autotune grid via META) per translations/advanced-patterns.md. Use when converting, porting, or translating cuTile kernels to Triton, or debugging existing Triton translations.
license
CC-BY-4.0 AND Apache-2.0
tools
- Read - Write - Grep - Glob - Bash
metadata
author: "TileGym Team <[email protected]>" tags: - cutile - triton - conversion - gpu - kernel

cuTile → Triton Conversion

Convert @ct.kernel kernels to @triton.jit. API mapping: references/api-mapping.md (cuTile → Triton).

In this skill’s Markdown, Triton launch syntax kernel[grid](…) uses Unicode brackets so link checkers do not parse [grid](…) as a hyperlink; use normal ASCII brackets in real Triton code.

Instructions

Follow the phase-gated workflow in translations/workflow.md. Every conversion should go through analyze → convert → validate → test → benchmark, with explicit gates before moving on. Use the documents in Workflow Selection when the task matches a special case (errors, layout flags, perf).

  1. Optimization strategy (perf-sensitive / attention) — If the op is attention, FMHA, sliding window, soft cap, or GQA (e.g. Gemma gemma_attention), read references/optimization-strategy.md before converting the inner loop, then apply §4 Gemma FMHA checklist. For other GEMM/BMM/attention-adjacent kernels, still skim §2–§3 of that file after TMA is done.

  2. Select path — Existing TileGym op: standard mode in translations/workflow.md. If the cuTile source uses transpose / transpose_v, dual layouts, or MLA-style paths, read translations/advanced-patterns.md before writing Triton (two kernels + META grid, not one kernel + tl.trans).

  3. Pre-flight — Run the Pre-flight Analysis grep commands on the cuTile source. Count @ct.kernel definitions; note TMA-relevant ct.load/ct.store, ct.launch, Constant, and layout flags.

  4. Read mapping — Keep references/api-mapping.md open for cuTile → Triton API pairs. For runtime failures (illegal address, dtype, strides), use references/debugging.md.

  5. Convert — Copy the Conversion Checklist into a todo list and execute in order. Structure and file placement: translations/file-structure.md. Mandatory: any 2D+ block-shaped tile load/store uses tl.make_tensor_descriptor (TMA), not raw tl.load(ptr+offs, mask=…) for full tiles—skipping this is the most common source of large regressions. Host side: Triton bracket launch <code>kernel[grid](args)</code> with tuple or lambda META: (…) for autotune; no ct.launch.

  6. Validate — Syntax-check the new Triton module; run the relevant TileGym pytest targets for the op: pytest tests/ops/test_<op>.py -k "triton" -vs. Fix failures before benchmarking.

  7. Benchmark — Compare Triton vs cuTile on perf tests. If Triton is clearly slower, follow PERFORMANCE ANALYSIS (Phase c2t-5) in translations/workflow.md and references/optimizing-reference.md for GEMM/BMM/attention; use references/optimization-strategy.md as the ordered checklist. If you see 10–50× slowdowns, read CRITICAL PERFORMANCE PATTERNS in that same workflow file first.

Execution rules (MUST):

  • Create and track the conversion checklist (e.g. TodoWrite) before editing kernel code; complete steps in order—do not skip pre-flight or TMA decisions.
  • For attention / FMHA / Gemma / GQA / soft cap / sliding window: read references/optimization-strategy.md and apply §4 before treating the conversion as optimized.
  • Do not ship raw pointer+mask 2D+ tile loads where TMA applies; document any intentional exception.
  • If tests or benchmarks fail a gate, stop and fix before declaring the conversion done—do not stack unverified changes.

Workflow Selection

  • Existing TileGym op → Standard Mode: translations/workflow.md
  • Errors (cudaErrorIllegalAddress, shape mismatch, numerical mismatch) → references/debugging.md
  • Advanced patterns (TMA, dual layout flags transpose, autotune + META grid, Array.slice, ct.gather().item()) → translations/advanced-patterns.md (MLA-style two kernels, avoid 3–15× regression on transpose=False).
  • Performance (Triton kernel slower than cuTile, autotuning, profiling) → translations/workflow.md (section PERFORMANCE ANALYSIS (Phase c2t-5))
  • Optimization strategy hub (ordered checklist: advanced-patterns + optimizing-reference) → references/optimization-strategy.md — read first for attention/FMHA/Gemma; then drill into the two source docs as needed
  • Optimizing GEMM/BMM/attention (after TMA, or Triton 10–20% slower) → references/optimizing-reference.md — EVEN_K fast path, transpose via pointer arithmetic, grid layout, autotune breadth, epilogue subtile; use these patterns during conversion and before perf sign-off (summarized in optimization-strategy §2–§3)
  • Gemma attention / GQA FMHA conversionreferences/optimization-strategy.md §4
  • Blackwell optimization (complex kernels with iterative algorithms, register pressure, loop unrolling) → references/optimizing-reference.md §9 — TMA descriptors, loop_unroll_factor, occupancy autotuning, TMEM-friendly block sizes, slab allocator, dual-path kernel design
  • ⚠️ 10-50x REGRESSION (catastrophic slowdown after conversion) → translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION)
  • ⚠️ Good perf on transpose=True only, collapse on transpose=False (or opposite) → translations/advanced-patterns.md — §1 Dual layout flag; two @triton.jit kernels + grid = lambda META: (... META["BLOCK_H"] ...)

Pre-flight Analysis (Run BEFORE converting)

# Count kernels (only main kernel gets @triton.jit, helpers stay plain def)
grep "@ct\.kernel" source.py | wc -l

# Check for patterns needing special handling
grep "ct\.transpose\|ct\.permute" source.py   # → use tl.trans/tl.permute
grep "ct\.astype" source.py                    # → use .to(dtype)
grep "ct\.load\|ct\.store" source.py          # → TMA for 2D+ (tl.make_tensor_descriptor), NOT raw tl.load(ptr+offs)
grep "ct\.launch" source.py                    # → bracket launch: kernel then [grid] then (args)
grep "ct\.Constant\|ct\.ConstInt" source.py    # → tl.constexpr
grep "ct\.cdiv" source.py                      # → triton.cdiv (host) or Python (a+b-1)//b
grep "ct\.bid\|ct\.num_blocks" source.py       # → tl.program_id/tl.num_programs
grep "1 << .*\.bit_length" source.py           # → triton.next_power_of_2 if needed
grep "transpose\|transpose_v" source.py       # → if hit, read translations/advanced-patterns.md (dual kernels + META grid)

Conversion Checklist

Copy this checklist and track progress:

Conversion Progress:
 [ ] Step 0 (attention / Gemma FMHA / GQA / soft cap / sliding window): Read [references/optimization-strategy.md](./references/optimization-strategy.md) and apply §4 checklist before inner-loop Triton
 [ ] Step 1: Pre-flight — run grep commands above, note special patterns and 2D+ loads (→ TMA)
 [ ] Step 2: Analyze source cuTile kernel (identify patterns, shapes, dtypes)
 [ ] Step 3: Create Triton file with correct structure (see translations/file-structure.md)
 [ ] Step 4: Convert kernel signature (tensor args → pointer args, Constant → constexpr)
 [ ] Step 4b: TMA (MANDATORY for 2D+ loads) — use tl.make_tensor_descriptor for every 2D+ tile load/store; do NOT ship raw tl.load(ptr+offs,mask) for block-shaped access (see workflow.md § TMA OPTIMIZATION)
 [ ] Step 5: Convert kernel body (apply gotchas table below + API mapping)
 [ ] Step 6: Convert host wrapper (grid tuple/lambda, bracket-style launch: kernel, grid, then arguments; no ct.launch); call triton.set_allocator(alloc_fn) if using TMA
 [ ] Step 7: Validate — run pytest or syntax check on Triton file
 [ ] Step 8: Test — run pytest, verify X passed 0 failed
 [ ] Step 9: If test fails → fix → re-validate → re-test (loop until green)
 [ ] Step 10: Benchmark — run perf test, compare vs cuTile (see workflow.md § PERFORMANCE ANALYSIS)
 [ ] Step 10b: If GEMM/BMM/attention and Triton &gt;20% slower → walk [references/optimization-strategy.md](./references/optimization-strategy.md) §2–§3 then [references/optimizing-reference.md](./references/optimizing-reference.md) (EVEN_K, transpose, grid, autotune, epilogue subtile), then re-benchmark
 [ ] Step 10c: If op has `transpose` / layout flag → read [translations/advanced-patterns.md](./translations/advanced-patterns.md); verify **separate kernels** per layout (not transpose-kernel + `tl.trans`); **autotuned** launches use `lambda META: (triton.cdiv(..., META["BLOCK_H"]), ...)` — no fixed `BLOCK_H`/`BLOCK_N` through `apply()` unless autotune is disabled

Post-conversion Verification (TMA is mandatory for 2D+ loads):
 [ ] TMA: All 2D+ tile loads use tl.make_tensor_descriptor(...).load([...]); no raw ptr+mask for block-shaped 2D+ access (else 5x-20x regression)
 [ ] Grid uses tuple or lambda (not 3-tuple required like cuTile)
 [ ] Triton autotune added if cuTile op used kernel_configs/autotune (see workflow § PERFORMANCE ANALYSIS)
 [ ] Host grid uses triton.cdiv where appropriate (not (a+b-1)//b only)
 [ ] Pointer/offset indexing: Triton uses element offsets (ptr + offs), not block index in tl.load (or use TMA descriptor)
 [ ] ct.astype(x, dtype) → x.to(dtype) in Triton
 [ ] ct.mma(a, b, acc=acc) → tl.dot(a, b, acc) (no keyword in Triton)
 [ ] Optional/None args: Triton allows None in kernel args if desired (cuTile required dummy+flag)
 [ ] Masking applied when BLOCK_SIZE > actual dimension (same as cuTile); with TMA, masks can often be removed for full tiles
 [ ] Reduction divisor uses actual_size, NOT BLOCK_SIZE
 [ ] fp32/tf32: Triton defaults allow_tf32=True; match cuTile behavior if you had explicit tf32 cast
 [ ] If any 2D+ load uses raw ptr+mask (exception only): document WHY TMA was not used
 [ ] tl.assume() alignment hints added for strides and pointers

Gotchas (Most Common Translation Errors) {#gotchas-most-common-translation-errors}

Comprehensive table of patterns that frequently break or regress when porting @ct.kernel to @triton.jitmma accumulator, type cast, grid, TMA usage, dtype handling, layout flags, batched matmul, etc.

See: references/gotchas.md — read this BEFORE writing the Triton kernel.

Performance Gotchas (10-50x Regression Risk) {#performance-gotchas-10-50x-regression-risk}

⚠️ These cause CATASTROPHIC slowdowns. Check BEFORE benchmarking.

Patterns and their impact: TMA vs raw ptr+mask (5-20×), autotune vs fixed tile sizes (2-3×), broadcast_to + tl.dot (10-50×), extract_slice chains (2-5×), and more.

See: references/performance-gotchas.md — full regression-risk table.

Full details: translations/workflow.md — section CRITICAL PERFORMANCE PATTERNS (AVOID 10-50x REGRESSION).

Full API mapping: references/api-mapping.md.

Triton math dtype (erf/erfc/exp/log/sqrt) and the "don't substitute erf with tanh" pattern: references/debugging.md — section Triton Math Function Dtype Requirements (CRITICAL).

Optimization strategy (hub)

File: references/optimization-strategy.md

Summarizes translations/advanced-patterns.md (layout flags, dual kernels, autotune+META, batched launch, Blackwell pointers) and references/optimizing-reference.md (post-TMA micro-opts, §9) into §1–§3 plus a mandatory §4 Gemma FMHA checklist.

Rule: For attention / FMHA / Gemma-style conversions, open optimization-strategy in the same session as workflow — do not rely on TMA alone for perf sign-off.

Reference Documents {#reference-documents}

Read from cuTile → Triton perspective. Core files live in this skill under ``.

CategoryDocumentContent
Strategyoptimization-strategy.mdOrdered hub: advanced-patterns + optimizing-reference; §4 Gemma FMHA mandatory checklist
Workflowstranslations/workflow.mdStandard c2t conversion (phases + checklist)
translations/file-structure.mdWhere to place Triton files when converting from cuTile
translations/advanced-patterns.mdDual layout flags (transpose), autotune + META grid, MLA-style two kernels
APIapi-mapping.mdcuTile → Triton mapping
optimizing-reference.mdGEMM/BMM/attention optimizations (EVEN_K, transpose, grid, autotune, epilogue subtile)
Gotchasgotchas.mdCommon cuTile→Triton translation errors (mma, dtype, grid, TMA, layout flags)
performance-gotchas.md10-50× regression-risk table (TMA vs ptr+mask, broadcast_to, extract_slice chains, autotune)
Testing & errorsreferences/debugging.mdTriton runtime errors (cudaErrorIllegalAddress, pointer type, stride overflow)

Worked Examples

Use cutile_kernel.py as source and triton_kernel.py as target:

ExampleDirectoryComplexity
Vector Addexamples/01_vector_add/Basic
Softmaxexamples/02_softmax/Intermediate
LayerNormexamples/03_layernorm/Intermediate
MatMulexamples/04_matmul/Advanced
Attentionexamples/05_attention/Advanced

Read cutile_kernel.py first, then triton_kernel.py, to see the inverse mapping.

⚠️ MANDATORY COMPLETION CHECKLIST (DO NOT SKIP)

A conversion is NOT COMPLETE until ALL items are checked. Copy and complete:

MANDATORY COMPLETION GATES:
 [ ] 1. CORRECTNESS: pytest passes with 0 failures
     Command: python -m pytest {test_path} -k "test_op and triton" -vs --tb=short
     Gate: "X passed, 0 failed"

 [ ] 2. TMA OPTIMIZATION: All 2D+ tile loads use tl.make_tensor_descriptor
     Verify: grep -n "tl.load.*mask" triton_file.py | wc -l  # Should be 0 for 2D+ ops
     Skip = 5-20x performance regression

 [ ] 3. PERFORMANCE TEST: Triton within 20% of cuTile baseline
     Command: python -m pytest {test_path} -k "test_perf" --print-record -v
     OR: Run benchmark script: cd tests/benchmark && python bench_{op}.py
     Gate: Triton TFLOPS >= 0.8 * CuTile TFLOPS

 [ ] 4. PERFORMANCE COMPARISON RECORDED:
     Document results:
     | Config | Triton (TFLOPS) | CuTile (TFLOPS) | Ratio |
     |--------|-----------------|-----------------|-------|
     | [fill] | [fill]          | [fill]          | [fill]|

CONVERSION COMPLETE: All 4 gates passed? → YES / NO

Why this matters:

  • Gate 1 catches functional bugs
  • Gate 2 prevents catastrophic 5-20x regressions (most common mistake)
  • Gate 3 validates that optimization was effective
  • Gate 4 creates accountability record

If any gate fails: Fix and re-verify before declaring complete.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

Get started →

Use Cases

Task Automation & Efficiency

Automate repetitive workflows and reduce manual effort

Example

Generate reports, summarize documents, draft communications

Save 3-5 hours per week on routine tasks

Knowledge Enhancement

Learn new skills, understand complex topics, get expert guidance

Example

Explain concepts, provide examples, suggest learning resources

Accelerate learning and skill development by 2x

Quality Improvement

Enhance output quality through reviews, suggestions, and refinements

Example

Review drafts, suggest improvements, catch errors

Improve work quality by 30-40% with less effort

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client with skill support
  • Clear understanding of task or problem to solve
  • Willingness to iterate and refine outputs

Time Estimate

15-45 minutes depending on use case complexity

Steps

  1. 1Install skill using provided installation command
  2. 2Test with simple use case relevant to your work
  3. 3Evaluate output quality and relevance
  4. 4Iterate on prompts to improve results
  5. 5Integrate into regular workflow if valuable

Common Pitfalls

  • Expecting perfect results without iteration
  • Not providing enough context in prompts
  • Using skill for tasks outside its intended scope
  • Accepting outputs without review and validation

Best Practices

✓ Do

  • +Start with clear, specific prompts
  • +Provide relevant context and constraints
  • +Review and refine all outputs before using
  • +Iterate to improve output quality
  • +Document successful prompt patterns

✗ Don't

  • Don't use without understanding skill limitations
  • Don't skip validation of outputs
  • Don't share sensitive information in prompts
  • Don't expect skill to replace human judgment

💡 Pro Tips

  • Be specific about desired format and style
  • Ask for multiple options to choose from
  • Request explanations to understand reasoning
  • Combine AI efficiency with human expertise

When to Use This

✓ Use when

Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.

✗ Avoid when

Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.

Learning Path

  1. 1Familiarize yourself with skill capabilities and limitations
  2. 2Start with low-risk, non-critical tasks
  3. 3Progress to more complex and valuable use cases
  4. 4Build expertise through regular use and experimentation

Related Skills

Reviews

4.540 reviews
  • L
    Layla KhanDec 28, 2024

    Solid pick for teams standardizing on skills: tilegym-converting-cutile-to-triton is focused, and the summary matches what you get after install.

  • A
    Arjun AbbasDec 20, 2024

    tilegym-converting-cutile-to-triton fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • N
    Noor RobinsonDec 12, 2024

    I recommend tilegym-converting-cutile-to-triton for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • C
    Chaitanya PatilDec 8, 2024

    tilegym-converting-cutile-to-triton has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • M
    Michael KhannaDec 8, 2024

    We added tilegym-converting-cutile-to-triton from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • P
    Piyush GNov 27, 2024

    Keeps context tight: tilegym-converting-cutile-to-triton is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • H
    Harper SanchezNov 27, 2024

    tilegym-converting-cutile-to-triton fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • L
    Layla GarciaNov 15, 2024

    I recommend tilegym-converting-cutile-to-triton for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • C
    Charlotte KhanNov 11, 2024

    We added tilegym-converting-cutile-to-triton from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • R
    Rahul SantraNov 7, 2024

    Solid pick for teams standardizing on skills: tilegym-converting-cutile-to-triton is focused, and the summary matches what you get after install.

showing 1-10 of 40

1 / 4

Discussion

Comments — not star reviews
  • No comments yet — start the thread.