tag

nemo-mbridge

20 indexed skills · max 10 per page

skills (20)

nemo-mbridge-perf-moe-dispatcher-selection

nvidia/skills · nemo-mbridge

0

Choose the right MoE token dispatcher (`alltoall`, DeepEP, or HybridEP) for the hardware, EP degree, and optimization stage. Summarizes patterns from DSV3, Qwen3, Qwen3-Next, and VLM bring-up work.

nemo-mbridge-perf-moe-long-context

nvidia/skills · nemo-mbridge

0

Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments.

nemo-mbridge-perf-moe-comm-overlap

nvidia/skills · nemo-mbridge

0

MoE expert-parallel communication overlap in Megatron Bridge. Covers dispatch/combine overlap, flex dispatcher backends, and expert wgrad scheduling.

nemo-mbridge-perf-moe-hardware-configs

nvidia/skills · nemo-mbridge

0

Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks.

nemo-mbridge-recipe-recommender

nvidia/skills · nemo-mbridge

0

Recommend and customize Megatron Bridge recipes for a user's model, GPU count, and training goal. Indexes library recipes (pretrain/SFT/PEFT) and performance recipes.

nemo-mbridge-perf-tp-dp-comm-overlap

nvidia/skills · nemo-mbridge

0

Operational guide for enabling TP, DP, and PP communication overlap in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.

nemo-mbridge-perf-sequence-packing

nvidia/skills · nemo-mbridge

0

Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints.

nemo-mbridge-perf-expert-parallel-overlap

nvidia/skills · nemo-mbridge

0

Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.

nemo-mbridge-resiliency

nvidia/skills · nemo-mbridge

0

Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.

nemo-mbridge-perf-hierarchical-context-parallel

nvidia/skills · nemo-mbridge

0

Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.

prevpage 1 / 2next