nemo-mbridge▌
20 indexed skills · max 10 per page
nemo-mbridge-perf-moe-dispatcher-selection
nvidia/skills · nemo-mbridge
Choose the right MoE token dispatcher (`alltoall`, DeepEP, or HybridEP) for the hardware, EP degree, and optimization stage. Summarizes patterns from DSV3, Qwen3, Qwen3-Next, and VLM bring-up work.
nemo-mbridge-perf-moe-long-context
nvidia/skills · nemo-mbridge
Long-context MoE training guidance for Megatron Bridge. Covers CP sizing, selective recompute, dispatcher choices, and practical patterns from DSV3, Qwen3, and Qwen3-Next long-context experiments.
nemo-mbridge-perf-moe-comm-overlap
nvidia/skills · nemo-mbridge
MoE expert-parallel communication overlap in Megatron Bridge. Covers dispatch/combine overlap, flex dispatcher backends, and expert wgrad scheduling.
nemo-mbridge-perf-moe-hardware-configs
nvidia/skills · nemo-mbridge
Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks.
nemo-mbridge-recipe-recommender
nvidia/skills · nemo-mbridge
Recommend and customize Megatron Bridge recipes for a user's model, GPU count, and training goal. Indexes library recipes (pretrain/SFT/PEFT) and performance recipes.
nemo-mbridge-perf-tp-dp-comm-overlap
nvidia/skills · nemo-mbridge
Operational guide for enabling TP, DP, and PP communication overlap in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
nemo-mbridge-perf-sequence-packing
nvidia/skills · nemo-mbridge
Validate and use packed sequences and long-context training in Megatron-Bridge, distinguishing offline packed SFT for LLMs from in-batch packing for VLMs, and applying the right CP constraints.
nemo-mbridge-perf-expert-parallel-overlap
nvidia/skills · nemo-mbridge
Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
nemo-mbridge-resiliency
nvidia/skills · nemo-mbridge
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.
nemo-mbridge-perf-hierarchical-context-parallel
nvidia/skills · nemo-mbridge
Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.