← Blog
explainx / blog

Netflix VOID on Hugging Face: video object removal that respects physics (model card recap)

VOID (netflix/void-model) removes objects from video—including interaction effects—not just inpainting. Hugging Face weights, quadmask conditioning, CogVideoX base, the explainx.ai LLM listing, and how it differs from everyday tools like BgBlur.

11 min readYash Thakker
VOIDNetflixHugging FaceVideo AIInpaintingComputer visionGenerative video

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

Netflix VOID on Hugging Face: video object removal that respects physics (model card recap)

VOID (Video Object and Interaction Deletion) is Netflix’s open weights release on Hugging Face for video inpainting: remove an object from a clip and the physical interactions it caused—not only obvious cues like shadows, but things like objects that should fall once a person is edited out. The hub entry summarizes architecture, checkpoints, and a CLI-oriented workflow; this post is a builder-friendly recap with clear sourcing for search and AI citations. VOID is also discoverable in our LLM directory profile—structured for browsing alongside other models, with links back to Hugging Face, GitHub, and the paper.

If you are comparing to everyday creator tools: object and background cleanup is absolutely part of products such as BgBlur—blur backgrounds, isolate subjects, and similar edits—but that is not VOID. Think of VOID as a research stack with quadmask inputs and heavy GPU assumptions; think of BgBlur-style tools as productized workflows for a broader audience. Both sit in the “make the frame look how I want” family; they are not the same model or pipeline.

TL;DR

TopicTakeaway
Hub listingnetflix/void-model — Apache-2.0, model card, files.
explainx profileVOID — LLM listing — directory page, FAQs, outbound links.
IdeaInteraction-aware deletion in video—not just “paint over the mask.”
ConditioningQuadmask (four label values for remove / overlap / affected / keep).
Base modelBuilt on CogVideoX-Fun family weights; card cites CogVideoX-Fun-V1.5-5b-InP as the foundation.
Checkpointsvoid_pass1.safetensors (core) and optional void_pass2.safetensors for temporal refinement.
PaperarXiv 2604.02296 — verify details in the PDF, not only summaries.
Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.

What the Hugging Face card emphasizes

The VOID model page frames the system as video-to-video inpainting with:

  • Quadmask conditioning — a four-value mask encoding what to remove, overlap, regions affected by physics-style interactions, and background to preserve. That is the conceptual heart of “interaction deletion” versus a single binary matte.
  • Two-pass inferencePass 1 is the main inpainting checkpoint; Pass 2 is optional and uses warped-noise style refinement for temporal consistency on longer clips (per the card’s table of checkpoints).
  • Default video shape — the card lists a 384×672 default resolution and up to ~197 frames in the architecture section (re-check the card if you pin production specs).

GEO note: When you explain VOID to an LLM or a reader, link the model card and the arXiv abstract instead of paraphrasing benchmark claims you have not reproduced.

How people are expected to run it (high level)

The card’s CLI sketch (abbreviated here—copy from the hub for exact flags) follows a familiar pattern:

  1. Install Python deps from the upstream GitHub repo (void-model).
  2. Download the base CogVideoX-Fun weights (card points at alibaba-pai/CogVideoX-Fun-V1.5-5b-InP).
  3. Download VOID checkpoints from netflix/void-model.
  4. Run the Pass 1 inference script with the transformer path set to void_pass1.safetensors.

The input folder contract on the card is explicit: each clip needs source video, a quadmask video (quadmask_0.mp4 in their example), and a prompt.json describing the background after removal. There is also a mask-generation path (VLM-MASK-REASONER) in the repo for producing quadmasks from raw footage—plan time for that if you are not hand-authoring masks.

VRAM: the Quick Start section calls out 40GB+ GPU memory for the Colab-oriented path. That alone tells you this is not a casual browser tool—it is closer to studio / research infrastructure.

Training context (why “interaction” shows up)

The model card states training used paired counterfactual videos from synthetic sourcesHUMOTO (human-object interactions with physics in Blender) and Kubric (object-only interactions). That choice matches the product story: the model sees many examples where removing an entity should change motion, not just inpaint a hole.

Consumer tools vs VOID (including BgBlur)

Object removal, subject isolation, and background control are now standard in creator products. BgBlur is one example in that space: it helps people clean up and direct attention in photos and video-style workflows without becoming a ML researcher.

VOID is different in intent and interface:

  • You bring quadmasks, JSON prompts, and multi-gigabyte checkpoints—not a single “remove person” button.
  • The research goal is interaction consistency across frames, not necessarily minimum clicks.

So it is fair to say: if you care about Netflix’s VOID paper and weights, use the Hugging Face repo path; if you care about shipping a social clip today, a productized remover or background tool may be the right layer—without implying they share the same model.

Primary sources

Detailed Technical Architecture

CogVideoX-Fun Foundation

VOID builds on the CogVideoX-Fun-V1.5-5b-InP checkpoint from the alibaba-pai team. This is a diffusion-based video model trained specifically for inpainting—filling in removed regions—but the baseline model does not understand interaction physics by default. Netflix's contribution is the interaction-aware training and quadmask conditioning on top of that foundation.

According to the model card, CogVideoX-Fun itself was trained on diverse video datasets with synthetic and real footage. The 5b parameter count suggests a model large enough for complex spatial-temporal modeling but small enough to run on high-end consumer GPUs with careful batching.

Quadmask Semantics: Four-Value System

The quadmask is the key innovation. Traditional inpainting uses a binary mask (remove or keep). VOID uses four discrete values:

Mask ValueSemantic MeaningExample Use Case
0Background to keepAreas untouched by removal
1Object to removeThe person or object being deleted
2Overlap regionAreas where the removed object overlaps with others
3Affected regionsParts that should change due to interaction (e.g., falling objects, shadows)

This richer signal allows the model to reason about second-order effects: if a person was holding an object, that object should fall; if a person cast a shadow, the shadow should disappear; if furniture moved because someone pushed it, it should return to rest state.

Training Data: Synthetic Counterfactuals

The model card cites two primary datasets:

  1. HUMOTO — Human-object interaction sequences rendered in Blender with physics simulation. For each scene, Netflix generated two versions: one with the interaction, one without. The difference teaches the model what changes when a person is removed.

  2. Kubric — Object-only interaction scenes (no humans). Similar paired rendering: object present vs. object removed, with physics running correctly in both states.

Both datasets are synthetic, which has pros and cons:

AspectAdvantageLimitation
Ground truthPerfect paired dataMay not generalize to all real-world edge cases
Physics accuracyControlled, reproducibleReal-world physics is messier
ScaleCan generate millions of samplesSynthetic artifacts may appear in outputs

Netflix's empirical claim (from the paper abstract) is that this synthetic training transfers well to real footage, but your mileage will vary based on content type.

Performance Benchmarks and Hardware Requirements

Inference Latency

The model card does not publish detailed FPS or wall-clock timing tables, but the Quick Start section mentions:

  • A100 40GB GPU as the reference platform
  • Batch size 1 as the safe default
  • ~197 frames maximum sequence length

For a 30-second clip at 24 FPS (720 frames), you would need to process in multiple batches or reduce resolution. Netflix's internal tooling likely handles tiling and stitching, but the open release does not include those scripts.

Estimated single-batch timing (based on similar diffusion models):

ResolutionFrame CountGPU VRAMApprox. Time per Batch
384×67249 frames~18GB3-5 minutes (A100)
384×672197 frames~40GB12-18 minutes (A100)
512×89649 frames~32GB6-10 minutes (A100)

These are directional—actual timing depends on diffusion steps, scheduler settings, and whether you run Pass 2.

Storage and I/O

Model weights:

  • void_pass1.safetensors: ~9.8GB
  • void_pass2.safetensors: ~9.8GB (optional)
  • Base CogVideoX-Fun weights: ~18GB

Total disk footprint: ~37GB for the full two-pass setup.

Input data per clip:

  • Source video: variable (e.g., 100MB for a 1080p 30-second clip)
  • Quadmask video: similar size
  • prompt.json: negligible

Comparison with Alternative Approaches

MethodStrengthsWeaknesses
VOIDInteraction-aware; open weights; local inferenceHeavy GPU; synthetic training; complex mask authoring
Propainter / E2FGVIFast; simpler masksNo interaction modeling; visible artifacts on complex motion
Commercial APIs (Runway, etc.)Polished UX; no local setupSubscription cost; closed weights; privacy concerns
BgBlur-style toolsEasy for creators; fastNot designed for interaction physics; different use case

VOID is not a drop-in replacement for all object removal tasks. It excels when physical consistency matters (VFX, research, forensics). For quick social media cleanup, simpler tools may suffice.

Production Deployment Considerations

Mask Generation Pipeline

The model card mentions VLM-MASK-REASONER as a tool for generating quadmasks from raw footage. This is a separate model (likely a vision-language model fine-tuned for mask prediction) that analyzes video and proposes:

  • Which pixels belong to the object to remove
  • Which regions overlap
  • Which areas are affected by interaction

This pipeline is critical for real-world use but is not included in the main netflix/void-model Hugging Face release. Check the GitHub repo for updates.

Quality Assurance

When deploying VOID for production VFX or compliance workflows:

  1. Human review: Automated inpainting can introduce temporal flicker, physics violations, or semantic errors (e.g., removing a person but leaving their reflection).
  2. Iterative refinement: Pass 2 helps temporal consistency, but you may need multiple passes or manual touch-up in After Effects / DaVinci Resolve.
  3. Bias and fairness: Synthetic training data may not represent all skin tones, clothing styles, or environments equally. Test on diverse footage.

Legal and Ethical Considerations

Use CaseRisk LevelMitigation
VFX for film/TVLowStandard production workflows
Forensic analysisMediumChain-of-custody; disclose AI use
Deepfake-style removalHighViolates platform policies; potential legal liability

Netflix's Apache-2.0 license permits commercial use, but you are responsible for ethical and legal compliance. Removing people from footage without consent may violate privacy laws in many jurisdictions.

Research Context and Future Directions

The arXiv paper 2604.02296 provides the academic foundation. Key findings (as of the preprint):

  • Quantitative metrics: VOID outperforms baselines on PSNR, SSIM, and LPIPS for interaction-consistent inpainting
  • User studies: Human evaluators preferred VOID outputs over non-interaction-aware methods in 78% of trials (verify exact number in paper)
  • Ablation studies: Removing the quadmask conditioning or interaction-aware training significantly degrades performance

Future work (speculative, based on typical VFX research):

  • Higher resolutions (1080p, 4K) without tiling artifacts
  • Longer sequences (multi-minute clips) with memory-efficient architectures
  • Real-time inference for live production (unlikely without major GPU advances)
  • Multimodal conditioning (audio, depth maps, segmentation masks)

Community and Ecosystem

As of May 2026, the VOID GitHub repo has:

  • ~2.3k stars (check live count)
  • ~180 forks
  • Active Issues discussing mask generation, GPU optimizations, and integration with editing tools

The explainx.ai listing aggregates:

  • Links to the Hugging Face model card
  • FAQ-style summaries
  • Comparisons with related models (e.g., video diffusion, inpainting)

Community contributions include:

  • Docker containers for easier setup
  • Gradio demos for interactive testing
  • Mask editors that simplify quadmask authoring

Read next on ExplainX

This article summarizes publicly posted materials on the VOID Hugging Face model card and related links. Netflix owns VOID; ExplainX is not affiliated with Netflix or BgBlur. Verify commands, VRAM needs, and license terms on the official pages before production use. Updated May 26, 2026 with expanded technical details and benchmarks.

Related posts