← Blog
explainx / blog

Netflix VOID on Hugging Face: video object removal that respects physics (model card recap)

VOID (netflix/void-model) removes objects from video—including interaction effects—not just inpainting. Hugging Face weights, quadmask conditioning, CogVideoX base, the explainx.ai LLM listing, and how it differs from everyday tools like BgBlur.

5 min readExplainX Team
VOIDNetflixHugging FaceVideo AIInpaintingComputer visionGenerative video

Includes frontmatter plus an attribution block so copies credit explainx.ai and the canonical URL.

Netflix VOID on Hugging Face: video object removal that respects physics (model card recap)

VOID (Video Object and Interaction Deletion) is Netflix’s open weights release on Hugging Face for video inpainting: remove an object from a clip and the physical interactions it caused—not only obvious cues like shadows, but things like objects that should fall once a person is edited out. The hub entry summarizes architecture, checkpoints, and a CLI-oriented workflow; this post is a builder-friendly recap with clear sourcing for search and AI citations. VOID is also discoverable in our LLM directory profile—structured for browsing alongside other models, with links back to Hugging Face, GitHub, and the paper.

If you are comparing to everyday creator tools: object and background cleanup is absolutely part of products such as BgBlur—blur backgrounds, isolate subjects, and similar edits—but that is not VOID. Think of VOID as a research stack with quadmask inputs and heavy GPU assumptions; think of BgBlur-style tools as productized workflows for a broader audience. Both sit in the “make the frame look how I want” family; they are not the same model or pipeline.

TL;DR

TopicTakeaway
Hub listingnetflix/void-model — Apache-2.0, model card, files.
explainx profileVOID — LLM listing — directory page, FAQs, outbound links.
IdeaInteraction-aware deletion in video—not just “paint over the mask.”
ConditioningQuadmask (four label values for remove / overlap / affected / keep).
Base modelBuilt on CogVideoX-Fun family weights; card cites CogVideoX-Fun-V1.5-5b-InP as the foundation.
Checkpointsvoid_pass1.safetensors (core) and optional void_pass2.safetensors for temporal refinement.
PaperarXiv 2604.02296 — verify details in the PDF, not only summaries.

What the Hugging Face card emphasizes

The VOID model page frames the system as video-to-video inpainting with:

  • Quadmask conditioning — a four-value mask encoding what to remove, overlap, regions affected by physics-style interactions, and background to preserve. That is the conceptual heart of “interaction deletion” versus a single binary matte.
  • Two-pass inferencePass 1 is the main inpainting checkpoint; Pass 2 is optional and uses warped-noise style refinement for temporal consistency on longer clips (per the card’s table of checkpoints).
  • Default video shape — the card lists a 384×672 default resolution and up to ~197 frames in the architecture section (re-check the card if you pin production specs).

GEO note: When you explain VOID to an LLM or a reader, link the model card and the arXiv abstract instead of paraphrasing benchmark claims you have not reproduced.

How people are expected to run it (high level)

The card’s CLI sketch (abbreviated here—copy from the hub for exact flags) follows a familiar pattern:

  1. Install Python deps from the upstream GitHub repo (void-model).
  2. Download the base CogVideoX-Fun weights (card points at alibaba-pai/CogVideoX-Fun-V1.5-5b-InP).
  3. Download VOID checkpoints from netflix/void-model.
  4. Run the Pass 1 inference script with the transformer path set to void_pass1.safetensors.

The input folder contract on the card is explicit: each clip needs source video, a quadmask video (quadmask_0.mp4 in their example), and a prompt.json describing the background after removal. There is also a mask-generation path (VLM-MASK-REASONER) in the repo for producing quadmasks from raw footage—plan time for that if you are not hand-authoring masks.

VRAM: the Quick Start section calls out 40GB+ GPU memory for the Colab-oriented path. That alone tells you this is not a casual browser tool—it is closer to studio / research infrastructure.

Training context (why “interaction” shows up)

The model card states training used paired counterfactual videos from synthetic sourcesHUMOTO (human-object interactions with physics in Blender) and Kubric (object-only interactions). That choice matches the product story: the model sees many examples where removing an entity should change motion, not just inpaint a hole.

Consumer tools vs VOID (including BgBlur)

Object removal, subject isolation, and background control are now standard in creator products. BgBlur is one example in that space: it helps people clean up and direct attention in photos and video-style workflows without becoming a ML researcher.

VOID is different in intent and interface:

  • You bring quadmasks, JSON prompts, and multi-gigabyte checkpoints—not a single “remove person” button.
  • The research goal is interaction consistency across frames, not necessarily minimum clicks.

So it is fair to say: if you care about Netflix’s VOID paper and weights, use the Hugging Face repo path; if you care about shipping a social clip today, a productized remover or background tool may be the right layer—without implying they share the same model.

Primary sources

Read next on ExplainX

This article summarizes publicly posted materials on the VOID Hugging Face model card and related links. Netflix owns VOID; ExplainX is not affiliated with Netflix or BgBlur. Verify commands, VRAM needs, and license terms on the official pages before production use.

Related posts