W.A.L.T - AI Video Generation

Yesterday, a joint research team from Stanford, Google, and GIT released a paper. The first author, Agrim Gupta, is a student of Fei-Fei Li. The title of the paper is "Photorealistic Video Generation with Diffusion Models".

Paper address: https://arxiv.org/abs/2312.06662. This paper introduces W.A.L.T (Diffusion Model for Photorealistic Video Generation). This model is a Transformer model that undergoes generative training for images and videos in a shared latent space.

Currently, W.A.L.T is not publicly available. You can watch some demonstration videos to understand its effects.

For more video effect demonstrations, visit: https://walt-video-diffusion.github.io/samples.html.

W.A.L.T's design has two key points:

Using a causal encoder to compress images and videos in a shared latent space.
To improve memory and training efficiency, a window-attention-based Transformer architecture is used for simultaneous spatial and temporal generative modeling in the latent space.

The model currently supports the following three types of effect transformations:

Text-to-video

Image-to-video

3D effect generation