Genie 2 from Google: A large-scale foundation world model - from text to interactive worlds

a foundational world model that can generate infinitely diverse, controllable, and playable 3D environments. This model can create interactive worlds based on a single prompt image, supporting interaction by human players or AI agents through keyboard and mouse.

capabilities

achieved a significant leap in universality, capable of generating extremely rich 3D virtual worlds.

which can simulate virtual worlds, including the results of any actions (such as jumping, swimming, etc.). This model is trained on large-scale video datasets, exhibiting many emergent capabilities in large-scale training, such as:

object interaction
complex character animation
physical effects
simulation and prediction of other agents' behaviors

motion control and intelligent response

has intelligent response capability to keyboard action inputs, correctly identifying characters and executing corresponding movement operations. For example, when the player uses the arrow keys, the model accurately identifies that the character needs to move, not the surrounding trees or clouds.

generation of counterfactual scenarios

can generate different trajectories from the same initial frame, simulating diverse "counterfactual" experiences. This provides more possibilities for agent training. From the same frame, each line of video presents completely different scenes based on different action inputs from the player.

long-term memory capability

has powerful memory functions, able to remember parts of the world beyond the current field of view and accurately present them when they re-enter the field of view. For example:

long-term video generation and dynamically expanding content

can dynamically generate reasonable new content while maintaining consistency in the virtual world, up to one minute long. This generative ability makes the virtual scene not only rich and varied but also highly coherent.

diverse environments and perspective switching

supports multiple different perspectives, including:

first-person perspective
isometric perspective
third-person driving mode

This diverse perspective allows researchers to flexibly adjust the way the environment is displayed according to different task requirements.

capability to generate complex 3D structures

has learned to create complex 3D visual scenes, easily presenting virtual buildings and terrains with depth and detail.

object functionality and interaction simulation

can also model various object interaction behaviors, such as:

balloon bursting
door opening
shooting explosive barrels

These interactive features enhance the realism and immersion of the virtual scene.

character animation and behavior simulation

has mastered how to generate diverse dynamic behaviors for characters. For example:

jumping
running
dancing

These animations make virtual characters more vivid and realistic.

modeling of non-player characters (NPCs)

not only generates other virtual agents but also supports simulating their complex interactions. For example:

collaborative or adversarial behaviors among multiple characters.

realistic physical effect modeling

can simulate physical phenomena, including:

water flow effects
smoke effects
gravity phenomena

These effects add realism to the generated world.

advanced light and shadow rendering capabilities

supports high-quality light and shadow rendering, including:

point light sources and directional light sources
reflection effects
halos and colored lighting

These characteristics further enhance visual expressiveness.

generating interactive worlds from real images

can generate prompts from real photos and accurately model dynamic effects such as wind blowing grass and water flowing. This ability blurs the boundary between virtual and reality, providing new possibilities for content creation.

rapid prototyping and environment testing

provides a powerful tool for rapid prototyping of diverse interactive experiences, enabling researchers to quickly experiment with new environments for training and testing embodied AI agents.

from images to diverse interactive scenarios

the following scenarios were modeled:

paper airplane flight
dragon flight
eagle flight
parachute flight

These prototype designs verify Genie 2's strong ability to animate different characters, providing rich exploration possibilities for agent interaction scenarios.

from concept art to interactive environments

Thanks to Genie 2's super-distributed generalization ability, concept art and hand-drawn sketches can be directly converted into interactive virtual environments. This not only accelerates the creative process of artists and designers but also provides innovative environmental prototype tools for research.

showcasing Genie 2's generative effects:

concept design -> interactive environment
hand-drawn sketch -> complete virtual world

agent's action capability in the virtual world

Through Genie 2, researchers can quickly generate rich and diverse environments and design agent performance in new tasks. These tasks are unseen during the agent's training phase.

Scenes generated using prompt words:

Prompt: "A screenshot of a third-person open world exploration game. The player is an adventurer exploring a forest. There is a house with a red door on the left, and a house with a blue door on the right. The camera is placed directly behind the player. #photorealistic #immersive"

Test objective:

Agent receives instructions: “Open the blue door” or “Open the red door”
Simulate the process of completing the task, demonstrating the agent’s adaptability

In the virtual environment generated by Genie 2, the SIMA agent completed these interactive instructions via keyboard and mouse input, showcasing strong task execution capabilities.

Through commands like “turn around” or “go to the back of the house,” test whether Genie 2 can generate consistent environmental scenes, verifying the logic and stability of its generative capabilities.

simulating complex scenarios and environmental decision-making

The following is an example of a complex scenario generated based on prompt words:

Prompt: "An image of a computer game showing a scene from inside a rough hewn stone cave or mine. The viewer's position is a 3rd person camera based above a player avatar looking down towards the avatar. The player avatar is a knight with a sword. In front of the knight avatar there are x3 stone arched doorways and the knight chooses to go through any one of these doors. Beyond the first and inside we can see strange green plants with glowing flowers lining that tunnel. Inside and beyond the second doorway there is a corridor of spiked iron plates riveted to the cave walls leading towards an ominous glow further along. Through the third door we can see a set of rough hewn stone steps ascending to a mysterious destination."

Through these scene prompts, researchers can direct the agent to choose:

“Go upstairs”
“Enter the plant area”
“Cross the middle corridor”

Result:

Genie 2 successfully generated diverse decision paths, providing rich exploration and learning scenarios for the agent.

world generation based on diffusion models

is an autoregressive latent variable diffusion model trained on large-scale video datasets. The model completes world generation through the following key processes:

encoding and dynamic modeling

The latent frames of the video are first processed by an autoencoder.
Subsequently, these latent frames are fed into a large transformer dynamics model, which is trained using a causal mask similar to large language models.

autoregressive sampling

performs autoregressive sampling on a frame-by-frame basis.
Each frame's generation combines previous actions and latent frame states.

classifier-free guidance

To improve the precision of motion control, the model adopts a classifier-free guidance strategy to optimize the generative effect.

All examples currently shown come from the base model without distillation, aiming to showcase the potential capabilities of Genie 2. Although distilled models can achieve real-time interaction, the output quality will decrease.

successfully achieved flexible and precise environment generation, becoming an important tool for researchers to explore virtual scenes and train agents.

Easter egg

During Genie 2's generative process, some unexpected scenes also demonstrated the model's creativity:

ghosts appear in the garden without any action input.
characters choose parkour over snowboarding activities.