← Back to blog

explainx / blog

Marlin-2B: the 2B video VLM that answers 'what is happening' and 'when' with structured timestamps (NemoStation, 2026)

NemoStation released Marlin-2B on May 20, 2026—a 2B parameter video VLM fine-tuned from Qwen3.5 that extracts structured Scene + Event captions with second-precise timestamps and resolves natural-language queries to span-grounded (start, end) ranges. Beats Qwen2.5-VL-7B by +6.4 mIoU on TimeLens-Bench, matches Gemini-2.0-Flash, and tops DREAM-1K in its weight class.

·11 min read·Yash Thakker
Video VLMMarlin-2BNemoStationDense CaptioningTemporal GroundingQwen3.5
Marlin-2B: the 2B video VLM that answers 'what is happening' and 'when' with structured timestamps (NemoStation, 2026)

On May 20, 2026, NemoStation released Marlin-2B—a 2B parameter video VLM that answers the two questions developers actually ask their videos: "What is happening?" and "When?" Fine-tuned from Qwen3.5-2B, Marlin produces structured Scene + Event captions with second-precise timestamps and resolves natural-language queries to span-grounded (start, end) ranges. At 2B params, it is the strongest open model in its weight class on dense captioning (DREAM-1K, CaReBench) and natural-language temporal grounding (TimeLens-Bench), matching Gemini-2.0-Flash on grounding and beating Qwen2.5-VL-7B by +6.4 mIoU—all while running on a single consumer GPU with vLLM and swift-deploy compatibility.

This article is a field guide: what Marlin is, how it works, benchmarks, usage examples, training details, and when to choose Marlin over larger VLMs.

TL;DR

QuestionShort answer
What is it?A 2B parameter video VLM fine-tuned from Qwen3.5-2B to extract structured information from videos—dense captions with timestamps + temporal grounding.
AnnouncedMay 20, 2026 by NemoStation team (Shubham Sharma).
Two modes(1) .caption() → Scene + Events JSON with second-precise timestamps. (2) .find(event) → (start, end) tuple for natural-language queries.
Key strengthBest-in-class temporal grounding at 2B—beats Qwen2.5-VL-7B by +6.4 mIoU on TimeLens-Bench, matches Gemini-2.0-Flash.
BenchmarksTops CaReBench leaderboard, sits between Tarsier-34B and Gemini-1.5-Pro on DREAM-1K.
TrainingTwo-stage: (1) SFT on ~400K curated clips (ActivityNet, LSMDC, Charades, + Gemini-3-Flash teacher). (2) SimPO preference optimization.
System requirementsSingle consumer GPU (RTX 3090/4090). Requires transformers ≥ 5.7.0, torch ≥ 2.11.0, torchcodec, qwen-vl-utils ≥ 0.0.14.
Open sourceYes—Apache 2.0 license. Hugging Face: NemoStation/Marlin-2B.
Related ecosystemPairs with Qwen3.5 models, agent skills, and video indexing pipelines.
Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.

Primary source: Hugging Face model card · Shubham Sharma on X


What is Marlin-2B?

Marlin is a 2B video VLM tuned for the two questions developers actually ask their videos: what is happening, and when?

Core capabilities:

  • Dense captioning with second-precise timestamps (Scene + Events structure)
  • Temporal grounding to span-grounded (start, end) ranges for natural-language queries
  • Character and object consistency across events
  • Atomic event detection with explicit boundaries (<start-end>)

Example workflow:

  1. Input: A 2-minute cooking video.
  2. marlin.caption() returns:
    • Scene: "A modern kitchen with stainless steel appliances, marble countertop, natural light from window."
    • Events:
      • <5.2 - 8.7> "Chef places cutting board on counter and arranges vegetables."
      • <8.7 - 14.3> "Chef dices onions with chef's knife."
      • <14.3 - 22.1> "Chef heats olive oil in pan on stovetop."
      • <22.1 - 30.5> "Chef saut\u00e9s onions in pan, stirring occasionally."
      • ...
  3. marlin.find(event="chef starts cooking") returns:
    • (14.3, 22.1) — the span where cooking begins (heating oil).

Why this matters: Most VLMs generate free-form prose that's hard to parse programmatically. Marlin produces typed Python dicts with explicit timestamps, making it ideal for video indexing, agent context, and downstream automation.


Feature 01: Caption mode—structured Scene + Events with timestamps

Problem: Existing video captioning models return unstructured prose like "A person is cooking in a kitchen." No timestamps, no event boundaries, no programmatic access.

Solution: Marlin's .caption() method returns parsed JSON:

result = marlin.caption("video.mp4")

print(result["caption"])  # full raw caption text (Scene: ... Events: ...)
print(result["scene"])    # parsed Scene paragraph
for ev in result["events"]:
    print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}")

Output structure:

{
  "caption": "Scene: A modern kitchen...\nEvents: <5.2 - 8.7> Chef places...",
  "scene": "A modern kitchen with stainless steel appliances...",
  "events": [
    {"start": 5.2, "end": 8.7, "description": "Chef places cutting board..."},
    {"start": 8.7, "end": 14.3, "description": "Chef dices onions..."},
    ...
  ]
}

Training format: The model was trained on a canonical prompt that forces Scene: <paragraph> followed by Events: <X.X - Y.Y> <description> format. At inference, the custom modeling code (modeling_marlin.py) wraps the prompt automatically and parses the structured output into typed Python dicts—no regex wrangling.

Use cases:

  • Video indexing (e.g., "find all videos where someone enters a room")
  • Agent context (LLM agents can read Marlin's output to understand what happened)
  • Content moderation (flag specific events by timestamp)
  • Training data generation (dense captions for future video models)

Feature 02: Find mode—natural-language temporal grounding

Problem: You want to locate a specific moment in a video—e.g., "when does the person start running?"—but scrubbing through manually takes forever.

Solution: Marlin's .find() method resolves queries to (start, end) tuples:

result = marlin.find("video.mp4", event="a person enters the room")

print(result["raw"])        # "From 14.3 to 18.2." (raw model output)
print(result["span"])       # (14.3, 18.2) tuple in seconds, or None on parse failure
print(result["format_ok"])  # True if output matched the trained format

Example queries:

  • "person starts running"(23.5, 28.1)
  • "car door opens"(10.2, 11.7)
  • "dog catches frisbee"(45.3, 47.8)

Training format: Marlin was trained on ground-truth spans from HC-STVG, VidSTG, and TimeLens datasets, producing "From X.X to Y.Y." format. The custom modeling code parses this into (start, end) tuples automatically.

Use cases:

  • Agent loops (fast enough to run inline—agent asks "when does X happen?" and gets immediate answer)
  • Video search (locate sub-second moments across a library)
  • Highlight reels (programmatically extract key moments)
  • Data labeling (bootstrap annotations for new datasets)

Feature 03: Benchmarks—best in class at 2B

Marlin sits at the Pareto frontier for 2B models on both dense captioning and temporal grounding:

Dense Captioning

BenchmarkMetricMarlin-2BQwen2.5-VL-7BTarsier-34BGemini-1.5-ProGemini-2.5-Flash
CaReBenchCIDEr1st placeTeacher model
DREAM-1KCIDErBetween Tarsier-34B and Gemini-1.5-ProLowerHigher0.21-0.43 above Marlin

Key takeaway: Marlin closes the gap to its Gemini-2.5-Flash teacher to within 0.21 / 0.43 of 10 on dense captioning, despite being 2B vs Flash's larger footprint.

Temporal Grounding

BenchmarkMetricMarlin-2BQwen2.5-VL-7BGemini-2.0-FlashTimeLens-8BMiMo-VL
TimeLens-CharadesmIoUMatches Gemini-2.0-Flash−6.4 mIoU below MarlinTied with Marlin+few points aboveHigher (task-specific)
TimeLens-ActivityNetmIoUMatches Gemini-2.0-Flash−6.4 mIoU below MarlinTied with Marlin+few points aboveHigher (task-specific)
TimeLens-QVHighlightsmIoUMatches Gemini-2.0-Flash−6.4 mIoU below MarlinTied with Marlin+few points aboveHigher (task-specific)

Key takeaway: Marlin beats Qwen2.5-VL-7B (3.5× larger) by +6.4 mIoU and matches Gemini-2.0-Flash on temporal grounding. Specialized 7B-8B models (TimeLens-7B/8B, MiMo-VL, Time-R1) hold the upper frontier because they have task-specific data during training—Marlin is the strongest general-purpose model at 2B.

Trajectory chart (from model card): The three-panel figure shows progression from Qwen3.5-2B baseMarlin-SFTMarlin-SimPO (release checkpoint):

  • CaReBench: Steady climb to top of leaderboard.
  • DREAM-1K: Closes gap to Gemini-2.5-Flash teacher.
  • TimeLens-Charades: Reaches Pareto frontier in 2B band, matches Gemini-2.5-Flash (non-thinking).

Feature 04: Training—two-stage SFT + SimPO

Stage 1: Supervised Fine-Tuning (SFT)

Data: ~400K high-quality clip-level annotations:

  • Public datasets: ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, TimeLens
  • Teacher annotations: Dense re-annotations from Gemini-3-Flash in thinking mode, producing temporally grounded atomic events with explicit <start-end> boundaries (not free-form prose)
  • Human review: Targeted review on highest-impact splits

Training setup:

  • Base model: Qwen3.5-2B with video-capable visual tower kept intact
  • Prompt: Fixed canonical prompt per mode (caption vs find), with Tarsier-schema output formatting
  • Compute: Single H100

Stage 2: Preference Optimization (SimPO)

Why SimPO? Cheaper and more stable than DPO at this scale—no reference model required.

Preference dataset:

  • Candidate completions from SFT checkpoint scored against Gemini-3-Flash judge using a rich rubric (factual accuracy, completeness, temporal alignment)
  • Resulting win/lose pairs align Marlin without needing a reference model

Result: Marlin-SimPO (the release checkpoint) improves over Marlin-SFT on all three benchmarks—see trajectory chart in model card.

Recipe paper: "Coming soon" according to the model card (as of May 21, 2026).


Feature 05: Developer-friendly API—transformers + convenience methods

Standard HF transformers API with two convenience methods (.caption, .find) added directly to the model object:

import torch
from transformers import AutoModelForCausalLM

marlin = AutoModelForCausalLM.from_pretrained(
    "NemoStation/Marlin-2B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map={"": "cuda"},
)
marlin.compile()  # optional — wraps torch.compile, faster after first call

Caption mode:

result = marlin.caption("video.mp4")
print(result["scene"])
for ev in result["events"]:
    print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}")

Find mode:

result = marlin.find("video.mp4", event="person enters the room")
print(result["span"])  # (14.3, 18.2) or None

Optional kwargs:

  • max_new_tokens=2048 (default) — generation token cap
  • prompt=None — override canonical prompt (almost always leave as None)
  • do_sample=False, temperature=1.0, top_p=1.0 — sampling controls

Advanced—raw inference: If you want to bypass helper methods and call .generate() directly:

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True)

messages = [{"role": "user", "content": [
    {"type": "video", "video": "video.mp4"},
    {"type": "text", "text": "Your custom prompt here"},
]}]
inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_tensors="pt", return_dict=True,
).to(marlin.device)

with torch.inference_mode():
    out = marlin.generate(**inputs, max_new_tokens=512, do_sample=False)
out = out[:, inputs["input_ids"].shape[1]:]
text = processor.batch_decode(out, skip_special_tokens=True)[0]
print(text)

Note: The model emits a <think> token at the start of every response (training artifact with add_non_thinking_prefix=True). The .caption() and .find() methods strip this automatically. If using .generate() directly, strip <think>...</think> from the start.


Video preprocessing—defaults match training

The custom modeling code sets these env vars internally (matching training-time setup):

Env varDefaultWhat it does
FORCE_QWENVL_VIDEO_READERtorchcodecVideo decoder backend
VIDEO_MAX_PIXELS200704Max pixels per frame (~448×448)
FPS2.0Frame sampling rate
FPS_MAX_FRAMES240Cap on total frames (covers ~2 min videos)
FPS_MIN_FRAMES4Floor for very short videos

Override: Set env vars in your shell before importing transformers if you need different values.


System requirements

Hardware:

  • Single consumer GPU (NVIDIA RTX 3090, RTX 4090, or equivalent)
  • Runs in BF16 with ~4GB VRAM

Software:

  • transformers &gt;= 5.7.0 (for native qwen3_5 architecture)
  • torch &gt;= 2.11.0
  • torchcodec (video decoding)
  • qwen-vl-utils &gt;= 0.0.14
  • av (torchcodec system dependency)
  • pillow

Install:

pip install "transformers>=5.7.0" "torch>=2.11.0" torchcodec "qwen-vl-utils>=0.0.14" av pillow

Optional:

  • torch.compile() for faster inference after first call
  • vLLM for batch inference and deployment
  • swift-deploy for production serving

Use cases: video indexing, agent context, content moderation

01. Video library indexing

Problem: You have 10,000 hours of footage and need to find all instances of "person opens a door."

Solution:

for video_path in video_library:
    result = marlin.find(video_path, event="person opens a door")
    if result["span"]:
        index.add(video_path, result["span"])

Result: Searchable index of time-stamped events across your entire library.


02. Agent context for multimodal workflows

Problem: An LLM agent needs to understand what happened in a video feed to decide next actions.

Solution:

result = marlin.caption("security_feed.mp4")
agent_context = {
    "scene": result["scene"],
    "events": result["events"],
}
agent.process(agent_context)  # agent can now reason over structured timeline

Result: The agent sees a structured timeline (not raw pixels or prose) and can reason: "Event 3 shows a person entering the restricted area at 14.3s—trigger alert."


03. Content moderation at scale

Problem: Flag videos containing specific events (violence, nudity, etc.) by timestamp.

Solution:

result = marlin.caption("user_upload.mp4")
for ev in result["events"]:
    if moderation_model.is_flagged(ev["description"]):
        flag(video_id, start=ev["start"], end=ev["end"])

Result: Timestamped moderation flags for human review or automated takedowns.


04. Training data generation for future models

Problem: You need dense captions with timestamps to train the next generation of video models.

Solution: Run Marlin over a large video corpus, export results as training data.

Result: High-quality annotations at scale—cheaper and faster than human labelers.


Marlin-2B vs larger VLMs

ModelParamsDense Captioning (DREAM-1K)Temporal Grounding (TimeLens mIoU)Runs on single GPUOpen source
Marlin-2B2BBetween Tarsier-34B and Gemini-1.5-ProMatches Gemini-2.0-FlashYes (RTX 3090/4090)Yes (Apache 2.0)
Qwen2.5-VL-7B7BLower than Marlin−6.4 mIoU below MarlinYesYes
Tarsier-34B34BLower than Gemini-1.5-ProNo (multi-GPU)Yes
Gemini-1.5-ProLargeHigher than MarlinNo (API only)No
Gemini-2.0-FlashMedium0.21-0.43 above MarlinTied with MarlinNo (API only)No
Gemini-2.5-FlashMediumTeacher for Marlin (higher)Teacher for Marlin (higher)No (API only)No
TimeLens-8B8B+few points above Marlin (task-specific)YesYes

When to choose Marlin:

  • You need structured output (JSON with timestamps), not prose
  • You want to run locally on consumer hardware
  • You need temporal grounding (find when X happens)
  • You want open-source under Apache 2.0
  • You need fast inference for agent loops or real-time applications

When to choose larger models:

  • You need peak visual quality over all else (Gemini-2.5-Flash, GPT-4o)
  • You're fine with API calls and don't need local deployment
  • You need long-form prose captions (not structured events)

Limitations and future work

10-second granularity: Marlin samples at 2 FPS with a 240-frame cap (~2 min videos). Very long videos (>2 min) may miss events.

Multichunk reasoning limited: The model has <think>-style chunked-video reasoning (chunk-time → source-time arithmetic), but this is not directly exposed via .caption() / .find(). Use raw prompts if needed.

No audio transcription: Marlin processes video frames and can generate synchronized audio, but does not transcribe speech. For speech-to-text, use a separate ASR model (e.g., Whisper, Cohere Transcribe).

Bias and hallucination: Like all VLMs, Marlin can hallucinate events or exhibit biases from training data. Validate outputs on safety-critical applications.

Future work (from model card):

  • Longer video support (>2 min)
  • Multichunk reasoning exposed via helper methods
  • Audio transcription integration
  • Recipe paper publication

Related on ExplainX


Sources


Model capabilities, benchmark rankings, and hardware requirements may change with future releases. Treat this as May 21, 2026 context—verify performance claims on the latest leaderboards before production deployment. Marlin-2B is Apache 2.0 licensed; commercial use is permitted.

Related posts