Google VideoPoet, zero-shot video generation LLM with amazing effects

). Today, Google officially released VideoPoet, an advanced zero-shot video generation large language model (LLM), and the results look impressive.

VideoPoet has the following seven video generation capabilities:

: Converts simple text descriptions into vivid video content.
: Creates dynamic videos from static images.
: Brings different visual styles to videos.
: Performs advanced video editing and modifications.
: Adds content to the edges of a video.
: Adds content within regions of a video.
: Automatically composes appropriate music for videos.

VideoPoet detailed paper: https://storage.googleapis.com/videopoet/paper.pdf

The working principle of VideoPoet is concise and effective. It uses pre-trained MAGVIT V2 video tokenizer and SoundStream audio tokenizer to convert image, video, and audio clips into a series of discrete codes that are compatible with text-based language models. Through an autoregressive language model, VideoPoet learns across video, image, audio, and text modalities to predict the next video or audio token in a sequence.

Moreover, VideoPoet introduces multimodal generative learning objectives such as text-to-video, image-to-video, video frame continuation, video inpainting and outpainting, video stylization, and video-to-audio. All these tasks can be combined to achieve additional zero-shot capabilities.

VideoPoet's architecture supports ultra-high-resolution video generation using multi-axis attention and video modeling conditioned on low-resolution tokens and text embeddings. This simple approach shows that language models can synthesize and edit videos with high temporal consistency. VideoPoet demonstrates state-of-the-art performance in video generation, particularly excelling at producing large-scale, interesting, and high-fidelity motions.

There is currently no place to use it, but the video effects are impressive. We look forward to its availability.