Mochi 1: Open-source video generation model

Mochi 1 Preview is an open-source video generation model released under the Apache 2.0 license. Mochi 1 can produce high-fidelity motion effects and achieve strong prompt responsiveness, significantly narrowing the gap between open-source and closed video generation systems.

Team Introduction

The core members of Genmo's team come from projects such as DDPM (denoising diffusion probabilistic models), DreamFusion, and Emu Video. Genmo’s technical advisory team also consists of top industry experts, including Ion Stoica, co-founder and executive chairman of Databricks and Anyscale, Pieter Abbeel, co-founder of Covariant and early member of OpenAI, and Joey Gonzalez, a pioneer in language model systems and co-founder of Turi.

Funding: Raised $28.4 million in Series A funding led by NEA with Rick Yang as the lead investor. Participating institutions include The House Fund, Gold House Ventures, WndrCo, Eastlink Capital Partners, Essence VC, and angel investors such as Abhay Parasnis (CEO of Typespace), Amjad Masad (CEO of Replit), Sabrina Hahn, Bonita Stewart, and Michele Catasta.

Genmo's mission is to unlock the right-brain potential of artificial general intelligence. Mochi 1 is an important starting point for building world simulators that can emulate everything, whether real or fictional.

Model Evaluation

Currently, there is a significant gap between video generation models and reality. Motion quality and prompt responsiveness are two key functions that video generation models have yet to fully possess.

Mochi 1 sets a new benchmark for open-source video generation and is highly competitive with leading closed models in terms of performance:

: Mochi 1 has excellent response capabilities to text prompts, generating videos that accurately reflect the content of the instructions, providing users with detailed control over characters, scenes, and actions. Automated metrics based on visual language models are used to evaluate prompt responsiveness, following the protocol of OpenAI DALL-E 3, while the generated videos are evaluated using the Gemini-1.5-Pro-002 model.
: Mochi 1 generates smooth 5.4-second videos at 30 frames per second, with high temporal consistency and realistic motion dynamics. The model can simulate physical phenomena such as fluid mechanics, hair, and fur dynamics, presenting consistently fluid human motions and gradually surpassing the "uncanny valley" effect. Evaluators are asked to focus on motion performance rather than single-frame aesthetics (evaluation criteria include the fun of movements, physical plausibility, and smoothness). Elo scores are calculated according to the LMSYS Chatbot Arena protocol.

Trial Use

The currently released version is the basic 480p edition. By the end of this year, the team will release the full version of Mochi 1, including Mochi 1 HD. Mochi 1 HD will support 720p video generation with higher detail fidelity and smoother motion performance, better handling extreme cases such as image distortion in complex scenes. Download the model: https://github.com/genmoai/models Official website: https://www.genmo.ai/play

I tried it myself, and the results were not great. I guess after the open-source community gets involved, there might be optimizations to improve its performance :)

Limitations

As a research preview, Mochi 1 is a dynamically evolving checkpoint with some known limitations. The initial version supports 480p video generation and may exhibit slight image distortion and artifacts in edge cases involving extreme motion. Since Mochi 1 is mainly optimized for realistic styles, it performs poorly when generating animated-style content. Additionally, the team anticipates that the community will fine-tune the model to meet different aesthetic needs.

Model Architecture

Mochi 1 has made significant progress in open-source video generation, adopting a diffusion model with 10 billion parameters based on the innovative Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained from scratch, it is currently the largest open-source video generation model. More importantly, this architecture is simple and easy to modify.

Efficiency is critical so that the community can run the Mochi 1 model. To this end, the team has also open-sourced the video VAE. The VAE can causally compress video data by 128 times, spatially compressing 8x8 and temporally compressing 6 times into a 12-channel latent space.

AsymmDiT simplifies text processing, focusing computational resources on visual reasoning, thus efficiently handling user prompts and compressed video tokens. AsymmDiT uses multimodal self-attention mechanisms to jointly attend to text and visual tokens and learns independent MLP layers for each modality, similar to Stable Diffusion 3. However, Mochi 1 configures almost four times more parameters in the visual stream compared to the text stream, processing visual information with a larger hidden dimension. Through non-square QKV and output projection layers, modality unification is achieved in the self-attention mechanism. This asymmetric design effectively reduces memory requirements during inference.

Many modern diffusion models use multiple pre-trained language models to process user prompts, but Mochi 1 encodes prompts through a single T5-XXL language model.

Mochi 1 can perform 3D attention inference on a context window of up to 44,520 video tokens. To locate each token, Mochi 1 extends learnable Rotational Position Embeddings (RoPE) to three dimensions. The network can learn mixed frequencies across spatial and temporal axes end-to-end.

Mochi 1 also benefits from some of the latest improvements in language model scaling, including SwiGLU feedforward layers, query-key normalization for enhanced stability, and sandwich normalization for controlling internal activations.