nemo-mbridge✦ Official

nemo-mbridge-mlm-bridge-training

Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.

nvidia/skillsUpdated Jun 23, 2026

Works with

Claude CodeCursorClineWindsurfCodexGooseGitHub CopilotZed

0

total installs

0

this week

1.7K

GitHub stars

0

upvotes

Install Skill

Run in your terminal

$npx skills install nvidia/skills/nemo-mbridge-mlm-bridge-training

0

installs

0

this week

1.7K

stars

Installation Guide

How to use nemo-mbridge-mlm-bridge-training on Cursor

AI-first code editor with Composer

1

Prerequisites

Before installing skills in Cursor, ensure your development environment meets these requirements:

  • Cursor installed and configured on your machine
  • Node.js 16+ with npm — verify with node --version
  • Active project directory where you want to add nemo-mbridge-mlm-bridge-training
2

Run the install command

Execute the skills CLI command in your project's root directory to begin installation:

$npx skills install nvidia/skills/nemo-mbridge-mlm-bridge-training

Fetches nemo-mbridge-mlm-bridge-training from nvidia/skills and configures it for Cursor.

3

Select Cursor when prompted

The CLI shows a list of agents. Use arrow keys and space to select Cursor:

◆ Which agents do you want to install to?
│ ── Universal (.agents/skills) ────────────────
│ · Cline · Codex · Goose · Windsurf
│ ●Cursor(selected)
│ · Cursor · Aider · Continue
4

Verify installation

Confirm successful installation by checking the skill directory location:

.cursor/skills/nemo-mbridge-mlm-bridge-training

Restart Cursor to activate nemo-mbridge-mlm-bridge-training. Access via /nemo-mbridge-mlm-bridge-training in your agent's command palette.

Security Notice

We perform automated surface-level scans (Gen AI Scanner, Socket, Snyk) during installation. These checks detect common vulnerabilities but do not guarantee complete security. Always review skill source code and verify the publisher's reputation before production use.

Skills execute code in your environment. Always review source, verify the publisher, and test in isolation before production.

Documentation

name
nemo-mbridge-mlm-bridge-training
description
Run Megatron-LM (MLM) and Megatron Bridge training with mock or real data. Covers correlation testing, available recipes, and multi-GPU examples.
license
Apache-2.0
when_to_use
Running training, comparing MLM vs Bridge loss curves, translating MLM CLI args to Bridge config, or investigating why loss curves diverged after a commit; 'how do I run training', 'MLM vs Bridge', 'correlation test'.

MLM vs Bridge Training

For how they differ, the arg mapping tables, gotchas, and translation script, see:

  • @docs/megatron-lm-to-megatron-bridge.md

First Answer Checklist

For MLM-vs-Bridge correlation questions, always name these items up front:

  1. Bridge recipe: vanilla_gpt_pretrain_config.
  2. Bridge entry point: scripts/training/run_recipe.py.
  3. MLM entry point: 3rdparty/Megatron-LM/pretrain_gpt.py.
  4. Launch wrapper for both: uv run python -m torch.distributed.run.
  5. Fresh-run cleanup: rm -rf nemo_experiments before the Bridge run.

Also state that MLM needs PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH, matched Bridge and MLM losses should agree within BF16 rounding, and files under 3rdparty/Megatron-LM/ should not be modified from this repo.

Correlation Testing

Use vanilla_gpt_pretrain_config for loss-correlation testing. This recipe uses bare GPTModelProvider defaults (LayerNorm, GeLU, learned_absolute position embeddings, vocab_size inherited from tokenizer) — matching MLM pretrain_gpt.py defaults with no args.

MLM Correlation Run (2L/256H, 1 GPU)

PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH \
uv run python -m torch.distributed.run --nproc_per_node=1 \
  3rdparty/Megatron-LM/pretrain_gpt.py \
  --num-layers 2 --hidden-size 256 --num-attention-heads 4 \
  --ffn-hidden-size 1024 --seq-length 512 --max-position-embeddings 512 \
  --micro-batch-size 4 --global-batch-size 32 \
  --train-iters 10 --eval-iters 2 --eval-interval 10 \
  --mock-data --bf16 --use-mcore-models \
  --tokenizer-type NullTokenizer --vocab-size 32000 \
  --lr 3e-4 --min-lr 3e-5 --seed 1234 --log-interval 1

Bridge Correlation Run (same config, 1 GPU)

rm -rf nemo_experiments && \
uv run python -m torch.distributed.run --nproc_per_node=1 \
  scripts/training/run_recipe.py \
  --recipe vanilla_gpt_pretrain_config \
  model.num_layers=2 model.hidden_size=256 \
  model.num_attention_heads=4 model.ffn_hidden_size=1024 \
  model.seq_length=512 dataset.sequence_length=512 \
  train.train_iters=10 train.global_batch_size=32 train.micro_batch_size=4 \
  validation.eval_interval=10 validation.eval_iters=2 \
  optimizer.lr=3e-4 optimizer.min_lr=3e-5 \
  scheduler.lr_warmup_iters=1 scheduler.lr_decay_iters=10 \
  rng.seed=1234 logger.log_interval=1

Verification

With matched parameters the LM losses should be nearly identical at each iteration. Compare lm loss values from both logs — they should agree to within BF16 rounding.

Multi-GPU Examples

MLM 2-GPU with TP=2

PYTHONPATH=3rdparty/Megatron-LM:$PYTHONPATH \
uv run python -m torch.distributed.run --nproc_per_node=2 \
  3rdparty/Megatron-LM/pretrain_gpt.py \
  --tensor-model-parallel-size 2 --sequence-parallel \
  --num-layers 4 --hidden-size 256 --num-attention-heads 4 \
  --seq-length 1024 --max-position-embeddings 1024 \
  --micro-batch-size 2 --global-batch-size 16 \
  --train-iters 10 --eval-iters 2 --eval-interval 10 \
  --mock-data --bf16 --use-mcore-models \
  --tokenizer-type NullTokenizer --vocab-size 1024 \
  --lr 1e-4 --log-interval 1

Bridge 2-GPU with TP=2

rm -rf nemo_experiments && \
uv run python -m torch.distributed.run --nproc_per_node=2 \
  scripts/training/run_recipe.py \
  --recipe vanilla_gpt_pretrain_config \
  model.tensor_model_parallel_size=2 model.sequence_parallel=true \
  model.num_layers=4 model.hidden_size=256 \
  model.num_attention_heads=4 model.ffn_hidden_size=1024 \
  model.seq_length=1024 dataset.sequence_length=1024 \
  train.train_iters=10 train.global_batch_size=16 train.micro_batch_size=2 \
  validation.eval_interval=10 validation.eval_iters=2 \
  scheduler.lr_warmup_iters=2 scheduler.lr_decay_iters=10 \
  logger.log_interval=1

Available Recipes

Common recipes (use with --recipe):

  • vanilla_gpt_pretrain_config — Minimal GPT (bare GPTModelProvider defaults, ideal for correlation testing and custom configs)
  • llama32_1b_pretrain_config — Llama 3.2 1B (16L, 2048H, GBS=512, seq=8192)
  • llama3_8b_pretrain_config — Llama 3 8B
  • qwen3_8b_pretrain_config — Qwen3 8B
  • deepseek_v2_lite_pretrain_config — DeepSeek-V2-Lite 16B MoE

SFT/PEFT variants use _sft_config / _peft_config suffix.

Megatron-Core Submodule

For what the submodule is and why two versions exist, see @docs/megatron-lm-to-megatron-bridge.md.

Check current version

./scripts/switch_mcore.sh status

Switch to dev for testing newer MCore features

./scripts/switch_mcore.sh dev

# uv sync (without --locked) since lockfile is for main
uv sync

Switch back to main

./scripts/switch_mcore.sh main

After pulling latest main

When you pull the latest Bridge main branch, the submodule pointer may have been updated. Re-sync the submodule:

git submodule update --init 3rdparty/Megatron-LM

Pitfalls

  1. Always rm -rf nemo_experiments before a fresh correlation run. Bridge auto-resumes from stale checkpoints silently.

  2. uv run required: Always use uv run python -m torch.distributed.run (not bare torchrun or python).

  3. MLM PYTHONPATH: Must include 3rdparty/Megatron-LM so gpt_builders.py is importable.

  4. Scheduler overrides: When overriding train.train_iters to a small value, also set scheduler.lr_warmup_iters and scheduler.lr_decay_iters or you get an assertion error.

  5. Use dataset.sequence_length in CLI overrides, not dataset.seq_length.

  6. MoE OOM: Large MoE models require full activation recomputation and typically multi-node EP. TP does NOT reduce per-GPU expert memory.

  7. uv sync --locked fails after switching to dev: The lockfile is generated against the main MCore commit. Use uv sync (without --locked) when on dev.

List & Monetize Your Skill

Submit your Claude Code skill and start earning

Get started →

Use Cases

Task Automation & Efficiency

Automate repetitive workflows and reduce manual effort

Example

Generate reports, summarize documents, draft communications

Save 3-5 hours per week on routine tasks

Knowledge Enhancement

Learn new skills, understand complex topics, get expert guidance

Example

Explain concepts, provide examples, suggest learning resources

Accelerate learning and skill development by 2x

Quality Improvement

Enhance output quality through reviews, suggestions, and refinements

Example

Review drafts, suggest improvements, catch errors

Improve work quality by 30-40% with less effort

Implementation Guide

Prerequisites

  • Claude Desktop or compatible AI client with skill support
  • Clear understanding of task or problem to solve
  • Willingness to iterate and refine outputs

Time Estimate

15-45 minutes depending on use case complexity

Steps

  1. 1Install skill using provided installation command
  2. 2Test with simple use case relevant to your work
  3. 3Evaluate output quality and relevance
  4. 4Iterate on prompts to improve results
  5. 5Integrate into regular workflow if valuable

Common Pitfalls

  • Expecting perfect results without iteration
  • Not providing enough context in prompts
  • Using skill for tasks outside its intended scope
  • Accepting outputs without review and validation

Best Practices

✓ Do

  • +Start with clear, specific prompts
  • +Provide relevant context and constraints
  • +Review and refine all outputs before using
  • +Iterate to improve output quality
  • +Document successful prompt patterns

✗ Don't

  • Don't use without understanding skill limitations
  • Don't skip validation of outputs
  • Don't share sensitive information in prompts
  • Don't expect skill to replace human judgment

💡 Pro Tips

  • Be specific about desired format and style
  • Ask for multiple options to choose from
  • Request explanations to understand reasoning
  • Combine AI efficiency with human expertise

When to Use This

✓ Use when

Use when skill capabilities match your task, clear ROI on time saved, and you can validate outputs. Best for repetitive tasks, learning, and quality improvement.

✗ Avoid when

Avoid when task requires deep expertise you can't validate, involves sensitive decisions, or when learning process is more valuable than speed of completion.

Learning Path

  1. 1Familiarize yourself with skill capabilities and limitations
  2. 2Start with low-risk, non-critical tasks
  3. 3Progress to more complex and valuable use cases
  4. 4Build expertise through regular use and experimentation

Related Skills

Reviews

4.636 reviews
  • G
    Ganesh MohaneDec 24, 2024

    nemo-mbridge-mlm-bridge-training reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • L
    Luis SinghDec 16, 2024

    We added nemo-mbridge-mlm-bridge-training from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • H
    Hassan IyerDec 12, 2024

    nemo-mbridge-mlm-bridge-training reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • A
    Amelia GillNov 27, 2024

    Solid pick for teams standardizing on skills: nemo-mbridge-mlm-bridge-training is focused, and the summary matches what you get after install.

  • R
    Rahul SantraNov 15, 2024

    I recommend nemo-mbridge-mlm-bridge-training for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • K
    Kabir BansalNov 7, 2024

    Keeps context tight: nemo-mbridge-mlm-bridge-training is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • C
    Carlos TorresNov 3, 2024

    I recommend nemo-mbridge-mlm-bridge-training for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • H
    Henry AbebeOct 26, 2024

    nemo-mbridge-mlm-bridge-training is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • C
    Carlos ReddyOct 22, 2024

    Useful defaults in nemo-mbridge-mlm-bridge-training — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • C
    Carlos HarrisOct 18, 2024

    nemo-mbridge-mlm-bridge-training has been reliable in day-to-day use. Documentation quality is above average for community skills.

showing 1-10 of 36

1 / 4

Discussion

Comments — not star reviews
  • No comments yet — start the thread.