Today, let’s take a look at Snap's Wonderland project, which generates three-dimensional scenes through a single image. Compared to traditional 3D reconstruction methods, its advantages are reflected in efficient generation, wide applicability, and high-quality 3D scene representation.
Introduction
It is the first proof that the latent space of diffusion models can be effectively utilized to build 3D reconstruction models, thus enabling efficient 3D scene generation.
Examples
One-shot generation of 3D scenes
Navigating 3D scenes generated by autoregressive methods
Video generation based on camera trajectories
Scene exploration under multiple camera trajectories
Methodology
Given a single image, a camera-guided video diffusion model generates a 3D-aware video latent space according to the camera trajectory. This latent space is then utilized by a large-scale reconstruction model based on the latent space (LaLRM) to construct a 3D scene in a feed-forward manner. The video diffusion model employs a dual-branch camera modulation mechanism to achieve precise control of camera poses. LaLRM operates within the latent space, efficiently reconstructing wide-angle, high-fidelity 3D scenes.