2024 is indeed regarded by many as the inaugural year of AI-generated video (AI Gen Video), thanks to significant advancements in video generation technology made by companies like OpenAI. The standout in this field is Sora, introduced by OpenAI, which can generate videos up to one minute long while maintaining high visual quality and closely adhering to user prompts.
OpenAI has demonstrated the tremendous potential of large-scale training ("great effort leads to miracles") by generating models trained on a large scale with video data. At the core of this approach is the joint training of text-conditional diffusion models, which are capable of handling videos and images of varying durations, resolutions, and aspect ratios. This means that Sora not only generates static images but also produces dynamic video content, offering users richer and more diverse creative possibilities.
OpenAI has adopted a special Transformer architecture that operates on the spacetime patches of video and image latent codes. By operating on these spacetime patches, Sora can understand and generate temporal continuity and spatial details in video content, resulting in naturally flowing video sequences.
Transforming visual data into patches
Just as LLMs have text tokens, Sora has visual patches. Previous research has shown that patches are an effective representation for visual data models. We found that patches are a highly scalable and efficient representation suitable for training generative models on various types of videos and images. At a high level, we achieve this by first compressing the video into a lower-dimensional latent space, then decomposing the representation into spacetime patches.
Video Compression Network
OpenAI trained a network that reduces the dimensionality of visual data. This network takes raw video as input and outputs a compressed latent representation in both time and space. Sora is trained within this compressed latent space and subsequently generates video in the same space. We also train a corresponding decoder model to map the generated latents back to pixel space.
Spatiotemporal Latent Patches
Given a compressed input video, we extract a series of spatiotemporal patches, which serve as tokens for the transformer. This approach also applies to images, as they are just single-frame videos. Our patch-based representation enables Sora to be trained on videos and images with varying resolutions, durations, and aspect ratios. During inference, we can control the size of the generated video by arranging randomly initialized patches on a grid of appropriate dimensions.
Extending Transformers for Video Generation
Sora is a diffusion model; given an input noisy patch (and conditional information like text prompts), it is trained to predict the original "clean" patch. Importantly, Sora is a diffusion transformer. Transformers have demonstrated remarkable scaling properties across domains, including language modeling, computer vision, and image generation. Diffusion transformers can also scale effectively as video models. As training compute increases, sample quality improves significantly.
Variable duration, resolution, aspect ratio
Previous image and video generation methods often resize, crop, or clip videos to standard sizes — for example, 4-second videos at 256x256 resolution. We find that, conversely, training on data in its native size provides several benefits:
Sora can sample widescreen 1920x1080p video, portrait 1080x1920 video, and everything in between. This enables Sora to create content natively in the aspect ratios of different devices. It also allows us to quickly prototype low-resolution content before generating full-resolution output — all with the same model.
We found through experimentation that training on videos in their native aspect ratios improves composition and layout. We compare Sora to a version of the model where all training videos were cropped to squares, a common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. By contrast, videos from Sora (right) have improved composition.
Language understanding
We applied the re-captioning technique we introduced in DALL·E 3 to video. We first train a highly descriptive caption generation model, then use it to generate text captions for all videos in our training set. We find that training on highly descriptive video captions improves both textual fidelity and overall video quality. Similar to DALL·E 3, we also use GPT to expand short user prompts into longer, detailed captions that are then sent to the video model. This enables Sora to produce high-quality videos that accurately follow user prompts.
Using images and videos as prompts
Sora can generate videos based on images and prompts as input. Below, we show example videos generated from DALL·E 2 and DALL·E 3 images.
These videos all expand backward from a segment that generates the video. Therefore, the three videos start differently but all end with the same conclusion.
We can use this method to extend a video both forward and backward to create a seamless infinite loop.
Diffusion models provide numerous methods for editing images and videos from text prompts. Below, we apply one of these methods, SDEdit, to Sora. This technique enables Sora to perform zero-shot style and scene transfer on input videos.
We can also use Sora to interpolate gradually between two input videos, creating a seamless transition between entirely different subjects and scene compositions. In the example below, the middle video interpolates between the corresponding videos on the left and right.
Generate images
This is achieved by arranging Gaussian noise blocks in a spatial grid with a single-frame time length. The model is capable of generating images of varying sizes, with resolutions as high as 2048x2048. This method demonstrates Sora's flexibility and efficiency when handling image generation tasks, meeting users' needs for high-quality and high-resolution images. Through this technology, Sora can create images of various sizes and styles without sacrificing detail quality, providing users with extensive creative possibilities.
Emergent capabilities
These capabilities enable Sora to simulate certain aspects of humans, animals, and environments in the real world. These properties naturally emerge without explicit inductive biases for 3D, objects, etc., purely as phenomena of scale effects.
Sora is capable of generating videos with dynamic camera movements. As the camera shifts and rotates, characters and scene elements move consistently within a three-dimensional space.
Maintaining temporal consistency in long videos has always been a significant challenge for video generation systems. Sora is typically able to effectively model both short-range and long-range dependencies. For example, our model can keep characters, animals, and objects even when they are occluded or leave the frame. Similarly, it can generate multiple shots of the same character within a single sample while maintaining their appearance throughout the entire video.
Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new brushstrokes on the canvas that persist over time, or someone eating a burger can leave bite marks.
Sora can simultaneously control the player in Minecraft while rendering the world and its dynamics at high fidelity. These capabilities can be zero-shot prompted by referring to keywords such as "Minecraft."
Limitations
Sora cannot accurately simulate the physical processes of many basic interactions, such as glass shattering. Other interactions, like eating food, do not always result in the correct change in object states. We also enumerate other common failure modes of the model — such as inconsistencies appearing in long samples or objects suddenly popping into existence.
Conclusion
We believe that the capabilities Sora has demonstrated so far show that continuing to scale up video models is a promising path toward developing capable simulators that can simulate the physical and digital worlds, along with the objects, animals, and humans within them. Despite existing limitations, these advancements reveal that by increasing the scale and complexity of the model, we are gradually able to overcome these challenges and move closer to creating AGI that can intricately simulate the world around us.