Advertisement

Meta's latest released video generation model VideoJAM

Previously, there were trials using Pika and Sora to generate gymnastics videos, but each gymnast ended up with a bizarre phenomenon of having "three heads and six arms." This distortion reflects a common issue in current video generation models — although the visual effects are impressive, there are still shortcomings in terms of motion coherence and authenticity.

Recently, Meta launchedVideoJAMa model specifically aimed at addressing this problem. According to Meta's published paper and demonstrations (which are currently not yet available for use), the VideoJAM model no longer focuses solely on pixel-level quality but introduces "joint appearance-motion representation" to ensure the naturalness and coherence of actions in generated videos.

Notably, VideoJAM can be applied to any existing video generation model without requiring additional modifications to training data or increasing the model size, demonstrating strong versatility.

Although Meta has only provided papers and demonstration videos so far without opening up actual usage of the model, from the existing demo results, VideoJAM has already significantly surpassed other existing models, particularly excelling in motion coherence while also enhancing overall visual quality.

How the VideoJAM model works

The core concept of the VideoJAM model is to inject strongermotion prior knowledgeinto the video generation model, thereby improving the motion consistency of the generated videos. The model consists of two key stages:

  • Training phase: The model not only learns to predict the pixels of the generated frames but also simultaneously predicts the motion within the frames.
    Given an input video (x_1) and its corresponding motion representation (d_1), both are noised and embedded into a unified joint latent space representation via a linear layer ((W_{in+})). Subsequently, the diffusion model processes this joint representation and predicts appearance and motion separately through two linear projection layers ((W_{out+})).

  • Inference stage: Effectively improves the consistency of video motions.
    The model employs a mechanism called "Inner-Guidance," which uses the noisy motion information predicted by the model itself at each step of video generation to guide subsequent predictions, thereby significantly enhancing the coherence of motion in the video.

Demonstration of the generation effects of the VideoJAM model

Here we showcase the results generated by Meta's latest launch of theVideoJAM-30Bmodel, which produces high-quality video results. The test scenarios all involve complex and highly challenging tasks.

Qualitative comparison with leading models: VideoJAM-bench benchmark evaluation

We also conducted evaluations using the VideoJAM-bench benchmark to compare the VideoJAM model with currently leading proprietary models in the industry, such asSora, Kling, and Runway Gen3) and the base model (DiT-30B). The test content was selected from representative action generation tasks. The results show that VideoJAM surpasses these existing leading models in terms of motion coherence and overall video quality, demonstrating significant advantages.