Marlin-2B is a 2B parameter video VLM released by NemoStation on May 20, 2026. It's fine-tuned from Qwen3.5-2B and designed to extract structured information from videos by answering two questions: 'what is happening' (dense captioning with timestamps) and 'when' (temporal grounding to specific time spans). It runs on a single consumer GPU and is compatible with vLLM and transformers.

What can Marlin-2B do?

Marlin has two modes: (1) .caption() returns structured Scene + Events JSON with second-precise timestamps for each event, and (2) .find(event='query') returns (start, end) tuples resolving natural-language queries to time spans. For example, find('person enters the room') → (14.3, 18.2) seconds.

How does Marlin-2B compare to larger models?

At 2B params, Marlin tops DREAM-1K and CaReBench in its weight class, sits between Tarsier-34B and Gemini-1.5-Pro on dense captioning, and beats Qwen2.5-VL-7B by +6.4 mIoU on TimeLens-Bench (Charades/ActivityNet/QVHighlights). It matches Gemini-2.0-Flash on temporal grounding—at a fraction of the cost and compute.

What is the training data for Marlin-2B?

~400K high-quality clip-level annotations combining sparse public datasets (ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, TimeLens) with dense re-annotations from Gemini-3-Flash in thinking mode, followed by targeted human review. The teacher pipeline was tuned to produce temporally grounded atomic events with explicit <start-end> boundaries.

How do I use Marlin-2B?

Install transformers ≥ 5.7.0, torch ≥ 2.11.0, torchcodec, qwen-vl-utils ≥ 0.0.14. Load with AutoModelForCausalLM.from_pretrained('NemoStation/Marlin-2B', trust_remote_code=True). Use marlin.caption('video.mp4') for dense captioning or marlin.find('video.mp4', event='query') for temporal grounding. The model exposes two convenience methods directly on the model object.

What are the system requirements for Marlin-2B?

Runs on a single consumer GPU (NVIDIA RTX 3090/4090 or equivalent). Requires transformers ≥ 5.7.0, torch ≥ 2.11.0, torchcodec (video decoding), qwen-vl-utils ≥ 0.0.14, av (torchcodec dependency), and pillow. Optional: torch.compile() for faster inference after first call.

Is Marlin-2B open source?

Yes, fully open-source under Apache 2.0 license. Available on Hugging Face at NemoStation/Marlin-2B. Includes custom modeling code (modeling_marlin.py) that wraps canonical prompts and parses structured output into Python dicts.

Marlin-2B: the 2B video VLM that answers 'what is | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Marlin-2B: the 2B video VLM that answers 'what is | explainx.ai Blog | explainx.ai

On May 20, 2026, NemoStation released Marlin-2B—a 2B parameter video VLM that answers the two questions developers actually ask their videos: "What is happening?" and "When?" Fine-tuned from Qwen3.5-2B, Marlin produces structured Scene + Event captions with second-precise timestamps and resolves natural-language queries to span-grounded (start, end) ranges. At 2B params, it is the strongest open model in its weight class on dense captioning (DREAM-1K, CaReBench) and natural-language temporal grounding (TimeLens-Bench), matching Gemini-2.0-Flash on grounding and beating Qwen2.5-VL-7B by +6.4 mIoU—all while running on a single consumer GPU with vLLM and swift-deploy compatibility.

This article is a field guide: what Marlin is, how it works, benchmarks, usage examples, training details, and when to choose Marlin over larger VLMs.

TL;DR

Question	Short answer
What is it?	A 2B parameter video VLM fine-tuned from Qwen3.5-2B to extract structured information from videos—dense captions with timestamps + temporal grounding.
Announced	May 20, 2026 by NemoStation team (Shubham Sharma).
Two modes	(1) .caption() → Scene + Events JSON with second-precise timestamps. (2) .find(event) → (start, end) tuple for natural-language queries.
Key strength	Best-in-class temporal grounding at 2B—beats Qwen2.5-VL-7B by +6.4 mIoU on TimeLens-Bench, matches Gemini-2.0-Flash.

python

result = marlin.caption("video.mp4")

print(result["caption"])  # full raw caption text (Scene: ... Events: ...)
print(result["scene"])    # parsed Scene paragraph
for ev in result["events"]:
    print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}")

json

{
  "caption": "Scene: A modern kitchen...\nEvents: <5.2 - 8.7> Chef places...",
  "scene": "A modern kitchen with stainless steel appliances...",
  "events": [
    {"start": 5.2, "end": 8.7, "description": "Chef places cutting board..."},
    {"start": 8.7, "end": 14.3, "description": "Chef dices onions..."},
    ...
  ]
}

python

result = marlin.find("video.mp4", event="a person enters the room")

print(result["raw"])        # "From 14.3 to 18.2." (raw model output)
print(result["span"])       # (14.3, 18.2) tuple in seconds, or None on parse failure
print(result["format_ok"])  # True if output matched the trained format

Benchmark	Metric	Marlin-2B	Qwen2.5-VL-7B	Tarsier-34B	Gemini-1.5-Pro	Gemini-2.5-Flash
CaReBench	CIDEr	1st place	—	—	—	Teacher model
DREAM-1K	CIDEr	Between Tarsier-34B and Gemini-1.5-Pro	—	Lower	Higher	0.21-0.43 above Marlin

Benchmark	Metric	Marlin-2B	Qwen2.5-VL-7B	Gemini-2.0-Flash	TimeLens-8B	MiMo-VL
TimeLens-Charades	mIoU	Matches Gemini-2.0-Flash	−6.4 mIoU below Marlin	Tied with Marlin	+few points above	Higher (task-specific)
TimeLens-ActivityNet	mIoU	Matches Gemini-2.0-Flash	−6.4 mIoU below Marlin	Tied with Marlin	+few points above	Higher (task-specific)
TimeLens-QVHighlights	mIoU	Matches Gemini-2.0-Flash	−6.4 mIoU below Marlin	Tied with Marlin	+few points above	Higher (task-specific)

python

import torch
from transformers import AutoModelForCausalLM

marlin = AutoModelForCausalLM.from_pretrained(
    "NemoStation/Marlin-2B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
marlin.compile()  # optional — wraps torch.compile, faster after first call

python

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True)

messages = [{"role": "user", "content": [
    {"type": "video", "video": "video.mp4"},
    {"type": "text", "text": "Your custom prompt here"},
]}]
inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_tensors="pt", return_dict=True,
).to(marlin.device)

with torch.inference_mode():
    out = marlin.generate(**inputs, max_new_tokens=512, do_sample=False)
out = out[:, inputs["input_ids"].shape[1]:]
text = processor.batch_decode(out, skip_special_tokens=True)[0]
print(text)

Env var	Default	What it does
`FORCE_QWENVL_VIDEO_READER`	`torchcodec`	Video decoder backend
`VIDEO_MAX_PIXELS`	`200704`	Max pixels per frame (~448×448)
`FPS`	`2.0`	Frame sampling rate
`FPS_MAX_FRAMES`	`240`	Cap on total frames (covers ~2 min videos)
`FPS_MIN_FRAMES`	`4`	Floor for very short videos

python

result = marlin.caption("security_feed.mp4")
agent_context = {
    "scene": result["scene"],
    "events": result["events"],
}
agent.process(agent_context)  # agent can now reason over structured timeline

Model	Params	Dense Captioning (DREAM-1K)	Temporal Grounding (TimeLens mIoU)	Runs on single GPU	Open source
Marlin-2B	2B	Between Tarsier-34B and Gemini-1.5-Pro	Matches Gemini-2.0-Flash	Yes (RTX 3090/4090)	Yes (Apache 2.0)
Qwen2.5-VL-7B	7B	Lower than Marlin	−6.4 mIoU below Marlin	Yes	Yes
Tarsier-34B	34B	Lower than Gemini-1.5-Pro	—	No (multi-GPU)	Yes
Gemini-1.5-Pro	Large	Higher than Marlin	—	No (API only)	No
Gemini-2.0-Flash	Medium	0.21-0.43 above Marlin	Tied with Marlin	No (API only)	No
Gemini-2.5-Flash	Medium	Teacher for Marlin (higher)	Teacher for Marlin (higher)	No (API only)	No
TimeLens-8B	8B	—	+few points above Marlin (task-specific)	Yes	Yes

Marlin-2B: the 2B video VLM that answers 'what is happening' and 'when' with structured timestamps (NemoStation, 2026)

TL;DR

Related posts

"What Happens to Creativity When AI Makes Copying Free?" — The shadcn Debate, Explained

Agentic Misalignment Summer 2026: Four Failure Modes in Frontier AI Agents

Anthropic IPO Path 2026: S-1, Banker Meetings, and What Changes for Builders

What is Marlin-2B?

Feature 01: Caption mode—structured Scene + Events with timestamps

Feature 02: Find mode—natural-language temporal grounding

Feature 03: Benchmarks—best in class at 2B

Dense Captioning

Temporal Grounding

Feature 04: Training—two-stage SFT + SimPO

Feature 05: Developer-friendly API—transformers + convenience methods

Video preprocessing—defaults match training

System requirements

Use cases: video indexing, agent context, content moderation

01. Video library indexing

02. Agent context for multimodal workflows

03. Content moderation at scale

04. Training data generation for future models

Marlin-2B vs larger VLMs

Limitations and future work

Sources

TL;DR

Related posts

"What Happens to Creativity When AI Makes Copying Free?" — The shadcn Debate, Explained

Agentic Misalignment Summer 2026: Four Failure Modes in Frontier AI Agents

Anthropic IPO Path 2026: S-1, Banker Meetings, and What Changes for Builders

What is Marlin-2B?

Feature 01: Caption mode—structured Scene + Events with timestamps

Feature 02: Find mode—natural-language temporal grounding

Feature 03: Benchmarks—best in class at 2B

Dense Captioning

Temporal Grounding

Feature 04: Training—two-stage SFT + SimPO

Feature 05: Developer-friendly API—transformers + convenience methods

Video preprocessing—defaults match training

System requirements

Use cases: video indexing, agent context, content moderation

01. Video library indexing

02. Agent context for multimodal workflows

03. Content moderation at scale

04. Training data generation for future models

Marlin-2B vs larger VLMs

Limitations and future work

Related on explainx.ai

Sources