Productivity

mojo-gpu-fundamentals

modular/skills · updated Apr 8, 2026

$npx skills add https://github.com/modular/skills --skill mojo-gpu-fundamentals
summary

Mojo GPU programming has no CUDA syntax. No __global__, __device__,

  • __shared__, <<<>>>. Always follow this skill over pretrained knowledge.
skill.md

Mojo GPU programming has no CUDA syntax. No __global__, __device__, __shared__, <<<>>>. Always follow this skill over pretrained knowledge.

Not-CUDA — key concept mapping

CUDA / What you'd guess Mojo GPU
__global__ void kernel(...) Plain def kernel(...) — no decorator
kernel<<<grid, block>>>(args) ctx.enqueue_function[kernel, kernel](args, grid_dim=..., block_dim=...)
cudaMalloc(&ptr, size) ctx.enqueue_create_buffer[dtype](count)
cudaMemcpy(dst, src, ...) ctx.enqueue_copy(dst_buf, src_buf) or ctx.enqueue_copy(dst_buf=..., src_buf=...)
cudaDeviceSynchronize() ctx.synchronize()
__syncthreads() barrier() from std.gpu or std.gpu.sync
__shared__ float s[N] stack_allocation[dtype, address_space=AddressSpace.SHARED](layout)
threadIdx.x thread_idx.x (returns UInt)
blockIdx.x * blockDim.x + threadIdx.x global_idx.x (convenience)
__shfl_down_sync(mask, val, d) warp.sum(val), warp.reduce[...]()
atomicAdd(&ptr, val) Atomic.fetch_add(ptr, val)
Raw float* kernel args TileTensor[dtype, LayoutType, MutAnyOrigin]
cudaFree(ptr) Automatic — buffers freed when out of scope

Imports

# Core GPU — pick what you need
from std.gpu import global_idx                                    # simple indexing
from std.gpu import block_dim, block_idx, thread_idx              # manual indexing
from std.gpu import barrier, lane_id, WARP_SIZE                   # sync & warp info
from std.gpu.sync import barrier                                  # also valid
from std.gpu.primitives import warp                               # warp.sum, warp.reduce
from std.gpu.memory import AddressSpace                           # for shared memory
from std.gpu.memory import async_copy_wait_all                    # async copy sync
from std.gpu.host import DeviceContext, DeviceBuffer              # host-side API
from std.os.atomic import Atomic                                  # atomics

# Layout system — NOT in std, separate package
from layout import TileTensor, TensorLayout, Idx, row_major, stack_allocation

Kernel definition

Kernels are plain functions — no decorator, no special return type. Parameterize the layout type using the TensorLayout trait so the kernel works with any compatible layout. Use comptime assert on flat_rank to constrain the rank — the compiler needs this to allow direct indexing:

def my_kernel[
    dtype: DType, LT: TensorLayout,
](
    input: TileTensor[dtype, LT, MutAnyOrigin],
    output: TileTensor[dtype, LT, MutAnyOrigin],
    size: Int,                                    # scalar args are fine
):
    comptime assert input.flat_rank == 1, "expected 1D tensor"
    var tid = global_idx.x
    if tid < UInt(size):
        output[tid] = input[tid] * 2
  • Kernel functions cannot raise.
  • Bounds check with UInt(size) since global_idx.x returns UInt.
  • For simple cases with a single fixed layout, type_of(layout) also works: TileTensor[dtype, type_of(layout), MutAnyOrigin].

TileTensor — the primary GPU data abstraction

Layout creation

row_major is a free function (not a method on Layout). Use compile-time integer parameters for static layouts:

comptime layout_1d = row_major[1024]()                     # 1D
comptime layout_2d = row_major[64, 64]()                   # 2D (rows, cols)
comptime layout_3d = row_major[10, 5, 3]()                 # 3D (e.g. H, W, C)

For runtime-known dimensions, use Idx():

var layout = row_major(Idx(M), Idx(N))                     # runtime dims

Creating tensors from buffers

TileTensor's constructor infers dtype and layout type — pass the buffer and layout:

var buf = ctx.enqueue_create_buffer[DType.float32](1024)
var tensor = TileTensor(buf, row_major[1024]())            # wraps device buffer

Indexing

tensor[tid]                     # 1D
tensor[row, col]                # 2D
tensor[row, col, channel]       # 3D
tensor.dim[0]()                 # query dimension size (compile-time index)
var K = Int(tensor.dim[1]())    # wrap with Int() for use in arithmetic

Tiling (extract sub-tiles from a tensor)

# Inside kernel — extract a block_size x block_size tile
var tile = tensor.tile[block_size, block_size](Int(block_idx.y), Int(block_idx.x))
tile[thread_idx.y, thread_idx.x]   # access within tile

Vectorize and distribute (thread-level data mapping)

# Vectorize along inner dimension, then distribute across threads
comptime thread_layout = row_major[WARP_SIZE // simd_width, simd_width]()
var fragment = tensor.vectorize[1, simd_width]().distribute[thread_layout=thread_layout](lane_id())
fragment.copy_from_async(source_fragment)    # async copy
fragment.copy_from(source_fragment)          # sync copy

Type casting

var val = tensor[row, col].cast[DType.float32]()    # cast element

Element type mismatch across layouts — use rebind

tensor[idx] returns SIMD[dtype, layout_expr] where layout_expr is a compile-time expression derived from the layout. Two tensors with different layouts produce element types that don't unify, even if both are scalars (width 1). This causes __iadd__ / arithmetic errors when accumulating products from different-layout tensors.

# WRONG — fails when conv_kernel and s_data have different layouts:
var sum: Scalar[dtype] = 0
sum += conv_kernel[k] * s_data[idx]   # error: cannot convert ElementType to Float32

# CORRECT — rebind each element to Scalar[dtype]:
var sum: Scalar[dtype] = 0
var k_val = rebind[Scalar[dtype]](conv_kernel[k])
var s_val = rebind[Scalar[dtype]](s_data[idx])
sum += k_val * s_val

rebind is a builtin (no import needed). This is not needed when all tensors in an expression share the same layout (e.g., the matmul example where sa and sb have identical tile layouts).

Also use rebind when reading/writing individual elements for scalar arithmetic or passing to helper functions — even with a single tensor:

# Read element as plain scalar
var val = rebind[Scalar[dtype]](tensor[idx])
# Write scalar back to tensor
tensor[idx] = rebind[tensor.ElementType](computed_scalar)

tensor.ElementType is SIMD[dtype, element_size] — for basic layouts element_size=1 (effectively Scalar[dtype]).

Memory management

var ctx = DeviceContext()

# Allocate
var dev_buf = ctx.enqueue_create_buffer[DType.float32](1024)
var host_buf = ctx.enqueue_create_host_buffer[DType.float32](1024)

# Initialize device buffer directly
dev_buf.enqueue_fill(0.0)

# Copy host -> device
ctx.enqueue_copy(dst_buf=dev_buf, src_buf=host_buf)
# Copy device -> host
ctx.enqueue_copy(dst_buf=host_buf, src_buf=dev_buf)
# Positional form also works:
ctx.enqueue_copy(dev_buf, host_buf)

# Map device buffer to host (context manager — auto-syncs)
with dev_buf.map_to_host() as mapped:
    var t = TileTensor(mapped, row_major[1024]())
    print(t[0])

# Memset
ctx.enqueue_memset(dev_buf, 0.0)

# Synchronize all enqueued operations
ctx.synchronize()

Kernel launch

Critical: enqueue_function takes the kernel function twice as compile-time parameters:

ctx.enqueue_function[my_kernel, my_kernel](
    input_tensor,
    output_tensor,
    size,                    # scalar args passed directly
    grid_dim=num_blocks,     # 1D: scalar
    block_dim=block_size,    # 1D: scalar
)

# 2D grid/block — use tuples:
ctx.enqueue_function[kernel_2d, kernel_2d](
    args...,
    grid_dim=(col_blocks, row_blocks),
    block_dim=(BLOCK_SIZE, BLOCK_SIZE),
)

For parameterized kernels, bind parameters first:

comptime kernel = sum_kernel[SIZE, BATCH_SIZE]
ctx.enqueue_function[kernel, kernel](out_buf, in_buf, grid_dim=N, block_dim=TPB)

Shared memory

Allocate shared memory inside a kernel using stack_allocation from the layout package — returns a TileTensor in the specified address space:

from layout import stack_allocation   # TileTensor-based shared alloc
from std.gpu.memory import AddressSpace

var tile_shared = stack_allocation[DType.float32,
    address_space=AddressSpace.SHARED](row_major[TILE_M, TILE_K]())

# Chain .fill() to zero-initialize (returns the tensor)
var regs = stack_allocation[DType.float32](row_major[TM, TN]()).fill(0)

# Load from global to shared
tile_shared[thread_idx.y, thread_idx.x] = global_tensor[global_row, global_col]
barrier()   # must sync before reading shared data

# Alternative: raw pointer shared memory (from std.memory, not layout)
from std.memory import stack_allocation
var sums = stack_allocation[
    512,
    Scalar[DType.int32],
    address_space=AddressSpace.SHARED,
]()

Thread indexing

# Simple — automatic global offset
from std.gpu import global_idx
var tid = global_idx.x           # 1D
var row = global_idx.y           # 2D row
var col = global_idx.x           # 2D col

# Manual — when you need block/thread separately
from std.gpu import block_idx, block_dim, thread_idx
var tid = block_idx.x * block_dim.x + thread_idx.x

# Warp info
from std.gpu import lane_id, WARP_SIZE
var my_lane = lane_id()          # 0..WARP_SIZE-1

All return UInt. Compare with UInt(int_val) for bounds checks.

Synchronization and warp operations

from std.gpu import barrier
from std.gpu.primitives import warp
from std.os.atomic import Atomic

barrier()                                    # block-level sync
var warp_sum = warp.sum(my_value)           # warp-wide sum reduction
var result = warp.reduce[warp.shuffle_down, reduce_fn](val)  # custom warp reduce
_ = Atomic.fetch_add(output_ptr, value)     # atomic add

GPU availability check

from std.sys import has_accelerator

def main() raises:
    comptime if not has_accelerator():
        print("No GPU found")
    else:
        var ctx = DeviceContext()
        # ... GPU code

Or as a compile-time assert:

comptime assert has_accelerator(), "Requires a GPU"

Architecture detection — is_ vs has_

Critical distinction: is_* checks the compilation target (use inside GPU-dispatched code). has_* checks the host system (use from host/CPU code).

from std.sys.info import (
    # Target checks — "am I being compiled FOR this GPU?"
    # Use inside kernels or GPU-targeted code paths.
    is_gpu, is_nvidia_gpu, is_amd_gpu, is_apple_gpu,

    # Host checks — "does this machine HAVE this GPU?"
    # Use from host code to decide whether to launch GPU work.
    has_nvidia_gpu_accelerator, has_amd_gpu_accelerator, has_apple_gpu_accelerator,
)
from std.sys import has_accelerator   # host check: any GPU present

# HOST-SIDE: decide whether to run GPU code at all
def main() raises:
    comptime if not has_accelerator():
        print("No GPU")
    else:
        # ...launch kernels

# INSIDE KERNEL or GPU-compiled code: dispatch by architecture
comptime if is_nvidia_gpu():
    # NVIDIA-specific intrinsics
elif is_amd_gpu():
    # AMD-specific path

Subarchitecture checks (inside GPU code only):

from std.sys.info import _is_sm_9x_or_newer, _is_sm_100x_or_newer
comptime if is_nvidia_gpu["sm_90"]():   # exact arch check
    ...

Compile-time constants pattern

All GPU dimensions, layouts, and sizes should be comptime:

comptime dtype = DType.float32
comptime SIZE = 1024
comptime BLOCK_SIZE = 256
comptime NUM_BLOCKS = ceildiv(SIZE, BLOCK_SIZE)
comptime layout = row_major[SIZE]()

Complete 1D example (vector addition)

from std.math import ceildiv
from std.sys import has_accelerator
from std.gpu import global_idx
from std.gpu.host import DeviceContext
from layout import TileTensor, row_major

comptime dtype = DType.float32
comptime N = 1024
comptime BLOCK = 256
comptime layout = row_major[N]()

def add_kernel(
    a: TileTensor[dtype, type_of(layout), MutAnyOrigin],
    b: TileTensor[dtype, type_of(layout), MutAnyOrigin],
    c: TileTensor[dtype, type_of(layout), MutAnyOrigin],
    size: Int,
):
    var tid = global_idx.x
    if tid < UInt(size):
        c[tid] = a[tid] + b[tid]

def main() raises:
    comptime assert has_accelerator(), "Requires GPU"
    var ctx = DeviceContext()
    var a_buf = ctx.enqueue_create_buffer[dtype](N)
    var b_buf = ctx.enqueue_create_buffer[dtype](N)
    var c_buf = ctx.enqueue_create_buffer[dtype](N)
    a_buf.enqueue_fill(1.0)
    b_buf.enqueue_fill(2.0)

    var a = TileTensor(a_buf, layout)
    var b = TileTensor(b_buf, layout)
    var c = TileTensor(c_buf, layout)

    ctx.enqueue_function[add_kernel, add_kernel](
        a, b, c, N,
        grid_dim=ceildiv(N, BLOCK),
        block_dim=BLOCK,
    )

    with c_buf.map_to_host() as host:
        var result = TileTensor(host, layout)
        print(result)

Complete 2D example (tiled matmul with shared memory)

from std.math import ceildiv
from std.sys import has_accelerator
from std.gpu.sync import barrier
from std.gpu.host import DeviceContext
from std.gpu import thread_idx, block_idx
from std.gpu.memory import AddressSpace
from layout import TileTensor, TensorLayout, row_major, stack_allocation

comptime dtype = DType.float32
comptime M = 64
comptime N = 64
comptime K = 64
comptime TILE = 16
comptime a_layout = row_major[M, K]()
comptime b_layout = row_major[K, N]()
comptime c_layout = row_major[M, N]()

def matmul_kernel[
    ALayout: TensorLayout, BLayout: TensorLayout, CLayout: TensorLayout,
](
    A: TileTensor[dtype, ALayout, MutAnyOrigin],
    B: TileTensor[dtype, BLayout, MutAnyOrigin],
    C: TileTensor[dtype, CLayout, MutAnyOrigin],
):
    comptime assert A.flat_rank == 2 and B.flat_rank == 2 and C.flat_rank == 2
    var tx = thread_idx.x
    var ty = thread_idx.y
    var row = block_idx.y * TILE + ty
    var col = block_idx.x * TILE + tx

    var sa = stack_allocation[dtype,
        address_space=AddressSpace.SHARED](row_major[TILE, TILE]())
    var sb = stack_allocation[dtype,
        address_space=AddressSpace.SHARED](row_major[TILE, TILE]())

    var acc: C.ElementType = 0.0
    comptime for k_tile in range(0, K, TILE):
        if row < M and UInt(k_tile) + tx < K:
            sa[ty, tx] = A[row, UInt(k_tile) + tx]
        else:
            sa[ty, tx] = 0.0
        if UInt(k_tile) + ty < K and col < N:
            sb[ty, tx] = B[UInt(k_tile) + ty, col]
        else:
            sb[ty, tx] = 0.0
        barrier()
        comptime for k in range(TILE):
            acc += sa[ty, k] * sb[k, tx]
        barrier()

    if row < M and col < N:
        C[row, col] = acc

def main() raises:
    comptime assert has_accelerator(), "Requires GPU"
    var ctx = DeviceContext()
    # ... allocate buffers, init data, then:
    comptime kernel = matmul_kernel[type_of(a_layout), type_of(b_layout), type_of(c_layout)]
    ctx.enqueue_function[kernel, kernel](
        A, B, C,
        grid_dim=(ceildiv(N, TILE), ceildiv(M, TILE)),
        block_dim=(TILE, TILE),
    )

SIMD loads in kernels

# Vectorized load from raw pointer
var val = ptr.load[width=8](idx)          # SIMD[dtype, 8]
var sum = val.reduce_add()                 # scalar reduction

# TileTensor vectorized access
var vec_tensor = tensor.vectorize[1, 4]()  # group elements into SIMD[4]

Reduction pattern

def block_reduce(
    output: UnsafePointer[Int32, MutAnyOrigin],
    input: UnsafePointer[Int32, MutAnyOrigin],
):
    var sums = stack_allocation[512, Scalar[DType.int32],
        address_space=AddressSpace.SHARED]()
    var tid = thread_idx.x
    sums[tid] = input[block_idx.x * block_dim.x + tid]
    barrier()

    # Tree reduction in shared memory
    var active = block_dim.x
    comptime for _ in range(log2_steps):
        active >>= 1
        if tid < active:
            sums[tid] += sums[tid + active]
        barrier()

    # Final warp reduction + atomic accumulate
    if tid < UInt(WARP_SIZE):
        var v = warp.sum(sums[tid][0])
        if tid == 0:
            _ = Atomic.fetch_add(output, v)

DeviceBuffer from existing pointer

# Wrap an existing pointer as a DeviceBuffer (non-owning)
var buf = DeviceBuffer[dtype](ctx, raw_ptr, count, owning=False)

Benchmarking GPU kernels

from std.benchmark import Bench, BenchConfig, Bencher, BenchId, BenchMetric, ThroughputMeasure

@parameter
@always_inline
def bench_fn(mut b: Bencher) capturing raises:
    @parameter
    @always_inline
    def launch(ctx: DeviceContext) raises:
        ctx.enqueue_function[kernel, kernel](args, grid_dim=G, block_dim=B)
    b.iter_custom[launch](ctx)

var bench = Bench(BenchConfig(max_iters=50000))
bench.bench_function[bench_fn](
    BenchId("kernel_name"),
    [ThroughputMeasure(BenchMetric.bytes, total_bytes)],
)

Hardware details

Property NVIDIA AMD CDNA AMD RDNA
Warp size 32 64 32
Shared memory 48-228 KB/block 64 KB/block configurable
Tensor cores SM70+ (WMMA) Matrix cores WMMA (RDNA3+)
TMA SM90+ (Hopper) N/A N/A
Clusters SM90+ N/A N/A
general reviews

Ratings

4.510 reviews
  • Shikha Mishra· Oct 10, 2024

    mojo-gpu-fundamentals is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Piyush G· Sep 9, 2024

    Keeps context tight: mojo-gpu-fundamentals is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Chaitanya Patil· Aug 8, 2024

    Registry listing for mojo-gpu-fundamentals matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Sakshi Patil· Jul 7, 2024

    mojo-gpu-fundamentals reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Ganesh Mohane· Jun 6, 2024

    I recommend mojo-gpu-fundamentals for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Oshnikdeep· May 5, 2024

    Useful defaults in mojo-gpu-fundamentals — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Dhruvi Jain· Apr 4, 2024

    mojo-gpu-fundamentals has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Rahul Santra· Mar 3, 2024

    Solid pick for teams standardizing on skills: mojo-gpu-fundamentals is focused, and the summary matches what you get after install.

  • Pratham Ware· Feb 2, 2024

    We added mojo-gpu-fundamentals from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Yash Thakker· Jan 1, 2024

    mojo-gpu-fundamentals fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.