Yesterday, a joint research team from Stanford, Google, and GIT released a paper. The first author, Agrim Gupta, is a student of Fei-Fei Li. The title of the paper is "Photorealistic Video Generation with Diffusion Models".
Paper address: https://arxiv.org/abs/2312.06662. This paper introduces W.A.L.T (Diffusion Model for Photorealistic Video Generation). This model is a Transformer model that undergoes generative training for images and videos in a shared latent space.
Currently, W.A.L.T is not publicly available. You can watch some demonstration videos to understand its effects.
For more video effect demonstrations, visit: https://walt-video-diffusion.github.io/samples.html.
W.A.L.T's design has two key points:
Using a causal encoder to compress images and videos in a shared latent space. To improve memory and training efficiency, a window-attention-based Transformer architecture is used for simultaneous spatial and temporal generative modeling in the latent space.
The model currently supports the following three types of effect transformations:
Text-to-video
Image-to-video
3D effect generation