On May 20, 2026, NemoStation released Marlin-2B—a 2B parameter video VLM that answers the two questions developers actually ask their videos: "What is happening?" and "When?" Fine-tuned from Qwen3.5-2B, Marlin produces structured Scene + Event captions with second-precise timestamps and resolves natural-language queries to span-grounded (start, end) ranges. At 2B params, it is the strongest open model in its weight class on dense captioning (DREAM-1K, CaReBench) and natural-language temporal grounding (TimeLens-Bench), matching Gemini-2.0-Flash on grounding and beating Qwen2.5-VL-7B by +6.4 mIoU—all while running on a single consumer GPU with vLLM and swift-deploy compatibility.
This article is a field guide: what Marlin is, how it works, benchmarks, usage examples, training details, and when to choose Marlin over larger VLMs.
TL;DR
| Question | Short answer |
|---|---|
| What is it? | A 2B parameter video VLM fine-tuned from Qwen3.5-2B to extract structured information from videos—dense captions with timestamps + temporal grounding. |
| Announced | May 20, 2026 by NemoStation team (Shubham Sharma). |
| Two modes | (1) .caption() → Scene + Events JSON with second-precise timestamps. (2) .find(event) → (start, end) tuple for natural-language queries. |
| Key strength | Best-in-class temporal grounding at 2B—beats Qwen2.5-VL-7B by +6.4 mIoU on TimeLens-Bench, matches Gemini-2.0-Flash. |
| Benchmarks | Tops CaReBench leaderboard, sits between Tarsier-34B and Gemini-1.5-Pro on DREAM-1K. |
| Training | Two-stage: (1) SFT on ~400K curated clips (ActivityNet, LSMDC, Charades, + Gemini-3-Flash teacher). (2) SimPO preference optimization. |
| System requirements | Single consumer GPU (RTX 3090/4090). Requires transformers ≥ 5.7.0, torch ≥ 2.11.0, torchcodec, qwen-vl-utils ≥ 0.0.14. |
| Open source | Yes—Apache 2.0 license. Hugging Face: NemoStation/Marlin-2B. |
| Related ecosystem | Pairs with Qwen3.5 models, agent skills, and video indexing pipelines. |
Complete AI Builder Bootcamp
Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.
The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.
The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.
Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.
Primary source: Hugging Face model card · Shubham Sharma on X
What is Marlin-2B?
Marlin is a 2B video VLM tuned for the two questions developers actually ask their videos: what is happening, and when?
Core capabilities:
- Dense captioning with second-precise timestamps (Scene + Events structure)
- Temporal grounding to span-grounded (start, end) ranges for natural-language queries
- Character and object consistency across events
- Atomic event detection with explicit boundaries (
<start-end>)
Example workflow:
- Input: A 2-minute cooking video.
- marlin.caption() returns:
- Scene: "A modern kitchen with stainless steel appliances, marble countertop, natural light from window."
- Events:
<5.2 - 8.7>"Chef places cutting board on counter and arranges vegetables."<8.7 - 14.3>"Chef dices onions with chef's knife."<14.3 - 22.1>"Chef heats olive oil in pan on stovetop."<22.1 - 30.5>"Chef saut\u00e9s onions in pan, stirring occasionally."...
- marlin.find(event="chef starts cooking") returns:
(14.3, 22.1)— the span where cooking begins (heating oil).
Why this matters: Most VLMs generate free-form prose that's hard to parse programmatically. Marlin produces typed Python dicts with explicit timestamps, making it ideal for video indexing, agent context, and downstream automation.
Feature 01: Caption mode—structured Scene + Events with timestamps
Problem: Existing video captioning models return unstructured prose like "A person is cooking in a kitchen." No timestamps, no event boundaries, no programmatic access.
Solution: Marlin's .caption() method returns parsed JSON:
result = marlin.caption("video.mp4")
print(result["caption"]) # full raw caption text (Scene: ... Events: ...)
print(result["scene"]) # parsed Scene paragraph
for ev in result["events"]:
print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}")
Output structure:
{
"caption": "Scene: A modern kitchen...\nEvents: <5.2 - 8.7> Chef places...",
"scene": "A modern kitchen with stainless steel appliances...",
"events": [
{"start": 5.2, "end": 8.7, "description": "Chef places cutting board..."},
{"start": 8.7, "end": 14.3, "description": "Chef dices onions..."},
...
]
}
Training format: The model was trained on a canonical prompt that forces Scene: <paragraph> followed by Events: <X.X - Y.Y> <description> format. At inference, the custom modeling code (modeling_marlin.py) wraps the prompt automatically and parses the structured output into typed Python dicts—no regex wrangling.
Use cases:
- Video indexing (e.g., "find all videos where someone enters a room")
- Agent context (LLM agents can read Marlin's output to understand what happened)
- Content moderation (flag specific events by timestamp)
- Training data generation (dense captions for future video models)
Feature 02: Find mode—natural-language temporal grounding
Problem: You want to locate a specific moment in a video—e.g., "when does the person start running?"—but scrubbing through manually takes forever.
Solution: Marlin's .find() method resolves queries to (start, end) tuples:
result = marlin.find("video.mp4", event="a person enters the room")
print(result["raw"]) # "From 14.3 to 18.2." (raw model output)
print(result["span"]) # (14.3, 18.2) tuple in seconds, or None on parse failure
print(result["format_ok"]) # True if output matched the trained format
Example queries:
"person starts running"→(23.5, 28.1)"car door opens"→(10.2, 11.7)"dog catches frisbee"→(45.3, 47.8)
Training format: Marlin was trained on ground-truth spans from HC-STVG, VidSTG, and TimeLens datasets, producing "From X.X to Y.Y." format. The custom modeling code parses this into (start, end) tuples automatically.
Use cases:
- Agent loops (fast enough to run inline—agent asks "when does X happen?" and gets immediate answer)
- Video search (locate sub-second moments across a library)
- Highlight reels (programmatically extract key moments)
- Data labeling (bootstrap annotations for new datasets)
Feature 03: Benchmarks—best in class at 2B
Marlin sits at the Pareto frontier for 2B models on both dense captioning and temporal grounding:
Dense Captioning
| Benchmark | Metric | Marlin-2B | Qwen2.5-VL-7B | Tarsier-34B | Gemini-1.5-Pro | Gemini-2.5-Flash |
|---|---|---|---|---|---|---|
| CaReBench | CIDEr | 1st place | — | — | — | Teacher model |
| DREAM-1K | CIDEr | Between Tarsier-34B and Gemini-1.5-Pro | — | Lower | Higher | 0.21-0.43 above Marlin |
Key takeaway: Marlin closes the gap to its Gemini-2.5-Flash teacher to within 0.21 / 0.43 of 10 on dense captioning, despite being 2B vs Flash's larger footprint.
Temporal Grounding
| Benchmark | Metric | Marlin-2B | Qwen2.5-VL-7B | Gemini-2.0-Flash | TimeLens-8B | MiMo-VL |
|---|---|---|---|---|---|---|
| TimeLens-Charades | mIoU | Matches Gemini-2.0-Flash | −6.4 mIoU below Marlin | Tied with Marlin | +few points above | Higher (task-specific) |
| TimeLens-ActivityNet | mIoU | Matches Gemini-2.0-Flash | −6.4 mIoU below Marlin | Tied with Marlin | +few points above | Higher (task-specific) |
| TimeLens-QVHighlights | mIoU | Matches Gemini-2.0-Flash | −6.4 mIoU below Marlin | Tied with Marlin | +few points above | Higher (task-specific) |
Key takeaway: Marlin beats Qwen2.5-VL-7B (3.5× larger) by +6.4 mIoU and matches Gemini-2.0-Flash on temporal grounding. Specialized 7B-8B models (TimeLens-7B/8B, MiMo-VL, Time-R1) hold the upper frontier because they have task-specific data during training—Marlin is the strongest general-purpose model at 2B.
Trajectory chart (from model card): The three-panel figure shows progression from Qwen3.5-2B base → Marlin-SFT → Marlin-SimPO (release checkpoint):
- CaReBench: Steady climb to top of leaderboard.
- DREAM-1K: Closes gap to Gemini-2.5-Flash teacher.
- TimeLens-Charades: Reaches Pareto frontier in 2B band, matches Gemini-2.5-Flash (non-thinking).
Feature 04: Training—two-stage SFT + SimPO
Stage 1: Supervised Fine-Tuning (SFT)
Data: ~400K high-quality clip-level annotations:
- Public datasets: ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, TimeLens
- Teacher annotations: Dense re-annotations from Gemini-3-Flash in thinking mode, producing temporally grounded atomic events with explicit
<start-end>boundaries (not free-form prose) - Human review: Targeted review on highest-impact splits
Training setup:
- Base model: Qwen3.5-2B with video-capable visual tower kept intact
- Prompt: Fixed canonical prompt per mode (caption vs find), with Tarsier-schema output formatting
- Compute: Single H100
Stage 2: Preference Optimization (SimPO)
Why SimPO? Cheaper and more stable than DPO at this scale—no reference model required.
Preference dataset:
- Candidate completions from SFT checkpoint scored against Gemini-3-Flash judge using a rich rubric (factual accuracy, completeness, temporal alignment)
- Resulting win/lose pairs align Marlin without needing a reference model
Result: Marlin-SimPO (the release checkpoint) improves over Marlin-SFT on all three benchmarks—see trajectory chart in model card.
Recipe paper: "Coming soon" according to the model card (as of May 21, 2026).
Feature 05: Developer-friendly API—transformers + convenience methods
Standard HF transformers API with two convenience methods (.caption, .find) added directly to the model object:
import torch
from transformers import AutoModelForCausalLM
marlin = AutoModelForCausalLM.from_pretrained(
"NemoStation/Marlin-2B",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map={"": "cuda"},
)
marlin.compile() # optional — wraps torch.compile, faster after first call
Caption mode:
result = marlin.caption("video.mp4")
print(result["scene"])
for ev in result["events"]:
print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}")
Find mode:
result = marlin.find("video.mp4", event="person enters the room")
print(result["span"]) # (14.3, 18.2) or None
Optional kwargs:
max_new_tokens=2048(default) — generation token capprompt=None— override canonical prompt (almost always leave as None)do_sample=False,temperature=1.0,top_p=1.0— sampling controls
Advanced—raw inference:
If you want to bypass helper methods and call .generate() directly:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True)
messages = [{"role": "user", "content": [
{"type": "video", "video": "video.mp4"},
{"type": "text", "text": "Your custom prompt here"},
]}]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_tensors="pt", return_dict=True,
).to(marlin.device)
with torch.inference_mode():
out = marlin.generate(**inputs, max_new_tokens=512, do_sample=False)
out = out[:, inputs["input_ids"].shape[1]:]
text = processor.batch_decode(out, skip_special_tokens=True)[0]
print(text)
Note: The model emits a <think> token at the start of every response (training artifact with add_non_thinking_prefix=True). The .caption() and .find() methods strip this automatically. If using .generate() directly, strip <think>...</think> from the start.
Video preprocessing—defaults match training
The custom modeling code sets these env vars internally (matching training-time setup):
| Env var | Default | What it does |
|---|---|---|
FORCE_QWENVL_VIDEO_READER | torchcodec | Video decoder backend |
VIDEO_MAX_PIXELS | 200704 | Max pixels per frame (~448×448) |
FPS | 2.0 | Frame sampling rate |
FPS_MAX_FRAMES | 240 | Cap on total frames (covers ~2 min videos) |
FPS_MIN_FRAMES | 4 | Floor for very short videos |
Override: Set env vars in your shell before importing transformers if you need different values.
System requirements
Hardware:
- Single consumer GPU (NVIDIA RTX 3090, RTX 4090, or equivalent)
- Runs in BF16 with ~4GB VRAM
Software:
transformers >= 5.7.0(for native qwen3_5 architecture)torch >= 2.11.0torchcodec(video decoding)qwen-vl-utils >= 0.0.14av(torchcodec system dependency)pillow
Install:
pip install "transformers>=5.7.0" "torch>=2.11.0" torchcodec "qwen-vl-utils>=0.0.14" av pillow
Optional:
torch.compile()for faster inference after first callvLLMfor batch inference and deploymentswift-deployfor production serving
Use cases: video indexing, agent context, content moderation
01. Video library indexing
Problem: You have 10,000 hours of footage and need to find all instances of "person opens a door."
Solution:
for video_path in video_library:
result = marlin.find(video_path, event="person opens a door")
if result["span"]:
index.add(video_path, result["span"])
Result: Searchable index of time-stamped events across your entire library.
02. Agent context for multimodal workflows
Problem: An LLM agent needs to understand what happened in a video feed to decide next actions.
Solution:
result = marlin.caption("security_feed.mp4")
agent_context = {
"scene": result["scene"],
"events": result["events"],
}
agent.process(agent_context) # agent can now reason over structured timeline
Result: The agent sees a structured timeline (not raw pixels or prose) and can reason: "Event 3 shows a person entering the restricted area at 14.3s—trigger alert."
03. Content moderation at scale
Problem: Flag videos containing specific events (violence, nudity, etc.) by timestamp.
Solution:
result = marlin.caption("user_upload.mp4")
for ev in result["events"]:
if moderation_model.is_flagged(ev["description"]):
flag(video_id, start=ev["start"], end=ev["end"])
Result: Timestamped moderation flags for human review or automated takedowns.
04. Training data generation for future models
Problem: You need dense captions with timestamps to train the next generation of video models.
Solution: Run Marlin over a large video corpus, export results as training data.
Result: High-quality annotations at scale—cheaper and faster than human labelers.
Marlin-2B vs larger VLMs
| Model | Params | Dense Captioning (DREAM-1K) | Temporal Grounding (TimeLens mIoU) | Runs on single GPU | Open source |
|---|---|---|---|---|---|
| Marlin-2B | 2B | Between Tarsier-34B and Gemini-1.5-Pro | Matches Gemini-2.0-Flash | Yes (RTX 3090/4090) | Yes (Apache 2.0) |
| Qwen2.5-VL-7B | 7B | Lower than Marlin | −6.4 mIoU below Marlin | Yes | Yes |
| Tarsier-34B | 34B | Lower than Gemini-1.5-Pro | — | No (multi-GPU) | Yes |
| Gemini-1.5-Pro | Large | Higher than Marlin | — | No (API only) | No |
| Gemini-2.0-Flash | Medium | 0.21-0.43 above Marlin | Tied with Marlin | No (API only) | No |
| Gemini-2.5-Flash | Medium | Teacher for Marlin (higher) | Teacher for Marlin (higher) | No (API only) | No |
| TimeLens-8B | 8B | — | +few points above Marlin (task-specific) | Yes | Yes |
When to choose Marlin:
- You need structured output (JSON with timestamps), not prose
- You want to run locally on consumer hardware
- You need temporal grounding (find when X happens)
- You want open-source under Apache 2.0
- You need fast inference for agent loops or real-time applications
When to choose larger models:
- You need peak visual quality over all else (Gemini-2.5-Flash, GPT-4o)
- You're fine with API calls and don't need local deployment
- You need long-form prose captions (not structured events)
Limitations and future work
10-second granularity: Marlin samples at 2 FPS with a 240-frame cap (~2 min videos). Very long videos (>2 min) may miss events.
Multichunk reasoning limited: The model has <think>-style chunked-video reasoning (chunk-time → source-time arithmetic), but this is not directly exposed via .caption() / .find(). Use raw prompts if needed.
No audio transcription: Marlin processes video frames and can generate synchronized audio, but does not transcribe speech. For speech-to-text, use a separate ASR model (e.g., Whisper, Cohere Transcribe).
Bias and hallucination: Like all VLMs, Marlin can hallucinate events or exhibit biases from training data. Validate outputs on safety-critical applications.
Future work (from model card):
- Longer video support (>2 min)
- Multichunk reasoning exposed via helper methods
- Audio transcription integration
- Recipe paper publication
Related on ExplainX
- Qwen3.5 models: architecture and capabilities — base model for Marlin
- What are agent skills? Complete guide — portable instruction packs for LLM agents
- Google Gemini Omni: video generation with natural language editing — competing video model from Google
- Gemini 2.0 Flash: multimodal model benchmarks — temporal grounding comparison target
- Agent harness engineering: when the model stays fixed — using VLM outputs in agent loops
Sources
- Hugging Face model card: huggingface.co/NemoStation/Marlin-2B
- Shubham Sharma (creator) on X: x.com/HappyyPablo
- NemoStation website: nemostation.com
- TimeLens-Bench paper: arXiv:2512.14698 — benchmark for temporal grounding
- Tarsier paper: arXiv:2407.00634 — DREAM-1K evaluation protocol
- CaReBench paper: arXiv:2501.00513 — fine-grained video captioning benchmark
Model capabilities, benchmark rankings, and hardware requirements may change with future releases. Treat this as May 21, 2026 context—verify performance claims on the latest leaderboards before production deployment. Marlin-2B is Apache 2.0 licensed; commercial use is permitted.