Google DeepMind's GO - Video matting So Easy!

The GO (Generative Omnimatte) algorithm from Google DeepMind has made breakthrough progress in decomposing videos into multiple layers. This method aims to break down videos into semantically meaningful layers that contain individual objects and their associated effects, such as shadows and reflections.

Specifically, the omnimatte method achieves this decomposition by using input video and a set of masks for target objects. However, existing omnimatte methods often assume a static background or rely on precise pose and depth estimation, leading to poor decomposition results when these assumptions are not valid. Additionally, due to the lack of generative priors for natural videos, existing methods cannot effectively complete dynamically occluded regions.

To address these issues, the authors propose a new generative layered video decomposition framework focusing on the omnimatte problem. This method does not require the assumption of a static scene or depend on camera pose and depth information, generating clean and complete layered videos, including credible completion of dynamically occluded regions. The core idea is to train a video diffusion model that identifies and removes scene effects caused by specific objects. Research shows that this model can be fine-tuned through existing video restoration models with just a small but carefully designed dataset to achieve high-quality decomposition and editing results.

Experimental results indicate that this method applies to various everyday captured video scenes, including soft shadows, smooth reflections, splashing water, etc., demonstrating outstanding decomposition and editing capabilities.

Method: Method

For the input video and its corresponding binary object mask, the method consists of two stages:

to generate a clean-plate background and a set of single-object (solo) videos. These solo videos are generated under different trimask conditions, where the trimask defines three regions:

: The part that needs to be fully preserved.
: The part that needs to be removed.
: The region that requires further processing.

In the second stage, Google combines the single-object videos and background video through test-time optimization to reconstruct the final omnimatte layers

This two-stage method effectively separates objects and their related effects in the video, generating semantically clear layered videos.

Object and Effect Removal: Based on Trimask Conditions

To separate objects and their effects from the input video, Google generates a set of single-object (solo) videos and a clean background video (bottom row) under different trimask conditions. Specifically, the trimask defines the following regions:

: Fully preserved video content.
: Objects and their effects that need to be removed.
: Uncertain regions requiring further processing.

model, Google did not optimize by selecting random seeds. For all different input videos, the same random seed (set to 0) was used to ensure the universality and stability of the method.

Comparative Analysis: Object and Effect Removal
Compare theCaspermodel with existing object removal methods.Results show:

) cannot effectively remove soft shadows and reflections outside the input mask range.
is an image-based method that processes video frame-by-frame, unable to utilize global context and lacking temporal consistency.
To ensure fairness, all methods were compared using the same mask dilation ratio.

Comparative Analysis: Omnimattes

). Existing methods have the following problems:

These methods rely on strict motion assumptions (such as a static background), causing dynamic background elements to entangle with foreground object layers.
's 3D-perceived background representation is sensitive to the quality of camera pose estimation, potentially generating blurry background layers (e.g., horse scenes).
Existing methods lack generative and semantic priors for completing occluded pixels, making it difficult to accurately associate effects with corresponding objects.

and the method proposed by Google significantly outperform existing methods in terms of object effect removal and omnimatte generation.

Trial

There is no open-source code available yet, but you can check out the Paper first: https://arxiv.org/pdf/2411.16683.

Google DeepMind's GO - Video matting So Easy!

Method: Method

Object and Effect Removal: Based on Trimask Conditions

Comparative Analysis: Object and Effect RemovalCompare theCaspermodel with existing object removal methods.Results show:

Comparative Analysis: Omnimattes

Comparative Analysis: Object and Effect Removal
Compare theCaspermodel with existing object removal methods.Results show: