a foundational world model that can generate infinitely diverse, controllable, and playable 3D environments. This model can create interactive worlds based on a single prompt image, supporting interaction by human players or AI agents through keyboard and mouse.
capabilities
achieved a significant leap in universality, capable of generating extremely rich 3D virtual worlds.
which can simulate virtual worlds, including the results of any actions (such as jumping, swimming, etc.). This model is trained on large-scale video datasets, exhibiting many emergent capabilities in large-scale training, such as:
object interaction complex character animation physical effects simulation and prediction of other agents' behaviors
motion control and intelligent response
has intelligent response capability to keyboard action inputs, correctly identifying characters and executing corresponding movement operations. For example, when the player uses the arrow keys, the model accurately identifies that the character needs to move, not the surrounding trees or clouds.
generation of counterfactual scenarios
can generate different trajectories from the same initial frame, simulating diverse "counterfactual" experiences. This provides more possibilities for agent training. From the same frame, each line of video presents completely different scenes based on different action inputs from the player.
long-term memory capability
has powerful memory functions, able to remember parts of the world beyond the current field of view and accurately present them when they re-enter the field of view. For example:
long-term video generation and dynamically expanding content
can dynamically generate reasonable new content while maintaining consistency in the virtual world, up to one minute long. This generative ability makes the virtual scene not only rich and varied but also highly coherent.
diverse environments and perspective switching
supports multiple different perspectives, including:
first-person perspective isometric perspective third-person driving mode
This diverse perspective allows researchers to flexibly adjust the way the environment is displayed according to different task requirements.
capability to generate complex 3D structures
has learned to create complex 3D visual scenes, easily presenting virtual buildings and terrains with depth and detail.
object functionality and interaction simulation
can also model various object interaction behaviors, such as:
balloon bursting door opening shooting explosive barrels
These interactive features enhance the realism and immersion of the virtual scene.
character animation and behavior simulation
has mastered how to generate diverse dynamic behaviors for characters. For example:
jumping running dancing
These animations make virtual characters more vivid and realistic.
modeling of non-player characters (NPCs)
not only generates other virtual agents but also supports simulating their complex interactions. For example:
collaborative or adversarial behaviors among multiple characters.
realistic physical effect modeling
can simulate physical phenomena, including:
water flow effects smoke effects gravity phenomena
These effects add realism to the generated world.
advanced light and shadow rendering capabilities
supports high-quality light and shadow rendering, including:
point light sources and directional light sources reflection effects halos and colored lighting
These characteristics further enhance visual expressiveness.
generating interactive worlds from real images
can generate prompts from real photos and accurately model dynamic effects such as wind blowing grass and water flowing. This ability blurs the boundary between virtual and reality, providing new possibilities for content creation.
rapid prototyping and environment testing
provides a powerful tool for rapid prototyping of diverse interactive experiences, enabling researchers to quickly experiment with new environments for training and testing embodied AI agents.
from images to diverse interactive scenarios
the following scenarios were modeled:
paper airplane flight dragon flight eagle flight parachute flight
These prototype designs verify Genie 2's strong ability to animate different characters, providing rich exploration possibilities for agent interaction scenarios.
from concept art to interactive environments
Thanks to Genie 2's super-distributed generalization ability, concept art and hand-drawn sketches can be directly converted into interactive virtual environments. This not only accelerates the creative process of artists and designers but also provides innovative environmental prototype tools for research.
showcasing Genie 2's generative effects:
concept design -> interactive environment hand-drawn sketch -> complete virtual world
agent's action capability in the virtual world
Through Genie 2, researchers can quickly generate rich and diverse environments and design agent performance in new tasks. These tasks are unseen during the agent's training phase.
Scenes generated using prompt words:
Prompt: "A screenshot of a third-person open world exploration game. The player is an adventurer exploring a forest. There is a house with a red door on the left, and a house with a blue door on the right. The camera is placed directly behind the player. #photorealistic #immersive"
Test objective:
Agent receives instructions: “Open the blue door” or “Open the red door” Simulate the process of completing the task, demonstrating the agent’s adaptability
In the virtual environment generated by Genie 2, the SIMA agent completed these interactive instructions via keyboard and mouse input, showcasing strong task execution capabilities.
Through commands like “turn around” or “go to the back of the house,” test whether Genie 2 can generate consistent environmental scenes, verifying the logic and stability of its generative capabilities.
simulating complex scenarios and environmental decision-making
The following is an example of a complex scenario generated based on prompt words:
Prompt: "An image of a computer game showing a scene from inside a rough hewn stone cave or mine. The viewer's position is a 3rd person camera based above a player avatar looking down towards the avatar. The player avatar is a knight with a sword. In front of the knight avatar there are x3 stone arched doorways and the knight chooses to go through any one of these doors. Beyond the first and inside we can see strange green plants with glowing flowers lining that tunnel. Inside and beyond the second doorway there is a corridor of spiked iron plates riveted to the cave walls leading towards an ominous glow further along. Through the third door we can see a set of rough hewn stone steps ascending to a mysterious destination."
Through these scene prompts, researchers can direct the agent to choose:
“Go upstairs” “Enter the plant area” “Cross the middle corridor”
Result:
Genie 2 successfully generated diverse decision paths, providing rich exploration and learning scenarios for the agent.
world generation based on diffusion models
is an autoregressive latent variable diffusion model trained on large-scale video datasets. The model completes world generation through the following key processes:
encoding and dynamic modeling
The latent frames of the video are first processed by an autoencoder. Subsequently, these latent frames are fed into a large transformer dynamics model, which is trained using a causal mask similar to large language models.
autoregressive sampling
performs autoregressive sampling on a frame-by-frame basis. Each frame's generation combines previous actions and latent frame states.
classifier-free guidance
To improve the precision of motion control, the model adopts a classifier-free guidance strategy to optimize the generative effect.
All examples currently shown come from the base model without distillation, aiming to showcase the potential capabilities of Genie 2. Although distilled models can achieve real-time interaction, the output quality will decrease.
successfully achieved flexible and precise environment generation, becoming an important tool for researchers to explore virtual scenes and train agents.
Easter egg
During Genie 2's generative process, some unexpected scenes also demonstrated the model's creativity:
ghosts appear in the garden without any action input. characters choose parkour over snowboarding activities.