Google's Latest Video Generation Model LUMIERE

I feel like part of my job is to enthusiastically support my former employer by keeping up with Google's AI advancements in various ways. I still have deep feelings for my old company and hope that Google's AI continues to become increasingly powerful.

This WeChat Official Account post updates on LUMIERE, which was released by Google yesterday.

See the results first.

Text-to-Video

Image-to-Video

Stylized Generation

Video Stylization

Cinemagraphs

Video Inpainting

Introduction

Lumiere — a text-to-video diffusion model designed to synthesize videos that display realistic, diverse, and coherent motions — addresses a key challenge in video synthesis. Lumiere introduces a spatiotemporal U-Net architecture capable of generating an entire temporal length of a video in one pass through the model. This contrasts with existing video models that synthesize distant keyframes and then apply temporal super-resolution — a method inherently making global temporal consistency difficult to achieve. By deploying spatial and (crucially) temporal downsampling and upsampling, and leveraging pretrained text-to-image diffusion models, Lumiere’s model learns to directly generate full-frame-rate, low-resolution videos and processes them across multiple spatiotemporal scales.

Paper address: https://arxiv.org/pdf/2401.12945.pdf

Maintaining temporal consistency in generated videos

Representative examples of videos generated for periodic motion using Lumiere's model and ImagenVideo (Ho et al., 2022a). Applying Lumiere’s image-to-video generation capability conditioned on the first frame of a video generated by ImagenVideo, and visualizing the corresponding X-T slices. Due to its cascaded design and temporal super-resolution modules, ImagenVideo struggles to generate globally consistent repetitive motions, as these modules cannot consistently resolve aliasing ambiguities within the temporal window.

Lumiere process

The main difference from the most common approach in previous works:

Common approaches include a base model for generating distant keyframes, followed by a series of temporal super-resolution (TSR) models to supplement frames. A spatial super-resolution (SSR) model is applied on non-overlapping windows to obtain high-resolution results.
In contrast, the base model in the Lumiere framework processes all frames at once, eliminating the need for cascaded TSR models, enabling Lumiere to learn globally consistent motions. To obtain high-resolution videos, Lumiere applies an SSR model on overlapping windows and combines predictions using MultiDiffusion (Bar-Tal et al., 2023), resulting in a coherent outcome.

STUNet architecture

"Inflating" the pretrained T2I U-Net architecture (Ho et al., 2022a) into a Spatiotemporal UNet (STUNet) that can downsample and upsample videos both spatially and temporally: