Advertisement

Google's Latest Video Generation Model LUMIERE

I feel like part of my job is to enthusiastically support my former employer by keeping up with Google's AI advancements in various ways. I still have deep feelings for my old company and hope that Google's AI continues to become increasingly powerful.

This WeChat Official Account post updates on LUMIERE, which was released by Google yesterday.

See the results first.

Text-to-Video

Image-to-Video

Stylized Generation

Video Stylization

Cinemagraphs

Video Inpainting

Introduction

Lumiere — a text-to-video diffusion model designed to synthesize videos that display realistic, diverse, and coherent motions — addresses a key challenge in video synthesis. Lumiere introduces a spatiotemporal U-Net architecture capable of generating an entire temporal length of a video in one pass through the model. This contrasts with existing video models that synthesize distant keyframes and then apply temporal super-resolution — a method inherently making global temporal consistency difficult to achieve. By deploying spatial and (crucially) temporal downsampling and upsampling, and leveraging pretrained text-to-image diffusion models, Lumiere’s model learns to directly generate full-frame-rate, low-resolution videos and processes them across multiple spatiotemporal scales.

Paper address: https://arxiv.org/pdf/2401.12945.pdf

Maintaining temporal consistency in generated videos

Representative examples of videos generated for periodic motion using Lumiere's model and ImagenVideo (Ho et al., 2022a). Applying Lumiere’s image-to-video generation capability conditioned on the first frame of a video generated by ImagenVideo, and visualizing the corresponding X-T slices. Due to its cascaded design and temporal super-resolution modules, ImagenVideo struggles to generate globally consistent repetitive motions, as these modules cannot consistently resolve aliasing ambiguities within the temporal window.

Lumiere process

The main difference from the most common approach in previous works:

  1. Common approaches include a base model for generating distant keyframes, followed by a series of temporal super-resolution (TSR) models to supplement frames. A spatial super-resolution (SSR) model is applied on non-overlapping windows to obtain high-resolution results.

  2. In contrast, the base model in the Lumiere framework processes all frames at once, eliminating the need for cascaded TSR models, enabling Lumiere to learn globally consistent motions. To obtain high-resolution videos, Lumiere applies an SSR model on overlapping windows and combines predictions using MultiDiffusion (Bar-Tal et al., 2023), resulting in a coherent outcome.

STUNet architecture

"Inflating" the pretrained T2I U-Net architecture (Ho et al., 2022a) into a Spatiotemporal UNet (STUNet) that can downsample and upsample videos both spatially and temporally:

  1. Schematic diagram of STUNet activation maps; colors represent features produced by different temporal modules.

  2. Convolution-based modules, including pretrained T2I layers followed by a factorized spatiotemporal convolution.

  3. Attention-based modules at the coarsest U-Net layer, where pretrained T2I layers are followed by temporal attention. Since the video representation is compressed at the coarsest layer, Lumiere stacks several temporal attention layers to limit computational overhead.

Comparison with other methods