u/Least_Light6037

We present VOID, a model for video object removal that aims to handle *physical interactions*, not just appearance.

Most existing video inpainting / object removal methods can fill in pixels behind an object (e.g., removing shadows or reflections), but they often fail when the removed object affects the dynamics of the scene.

For example:
- A domino chain is falling → removing the middle blocks should stop the chain
- Two cars are about to crash → removing one car should prevent the collision

Current models typically remove the object but leave its effects unchanged, resulting in physically implausible outputs.

VOID addresses this by modeling counterfactual scene evolution:
“What would the video look like if the object had never been there?”

Key ideas:
- Counterfactual training data: paired videos with and without objects (generated using Kubric and HUMOTO)
- VLM-guided masks: a vision-language model identifies which regions of the scene are affected by the removal
- Two-pass generation: first predict the new motion, then refine with flow-warped noise for temporal consistency

In a human preference study on real-world videos, VOID was selected 64.8% of the time over baselines such as Runway (Aleph), Generative Omnimatte, and ProPainter.

Project page: https://void-model.github.io/
Code: https://github.com/Netflix/void-model
Demo: https://huggingface.co/spaces/sam-motamed/VOID
Paper: https://arxiv.org/abs/2604.02296

Happy to answer questions!

Removing the compressor and saving the duckie.

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)