Advertisement

Microsoft's TRELLIS: A high-quality 3D asset generation model

Microsoft has recently proposed a generative method for creating high-quality 3D assets, based on a unified Structured LATent (SLAT) representation and Rectified Flow Transformers, achieving flexible and efficient 3D generation.

Core of the paper

  1. Unified Structured LATent Representation (SLAT)

  • SLAT combines sparse 3D meshes with dense multi-view features extracted from vision foundation models.
  • Captures geometric structure and textural information, supporting multiple decoding formats including Radiance Fields, 3D Gaussians, and Meshes.
  • Provides flexible decoding capabilities to output diverse 3D formats according to different needs.
  • Powerful generative model architecture

    • Uses a Rectified Flow Transformer specifically designed for SLAT as the core model.
    • Trained on a large-scale dataset of 3D assets containing over 500,000 diverse objects, with a parameter scale reaching up to 2 billion.
  • Flexible generation and editing capabilities

    • Supports generating high-quality 3D assets through text or image inputs, significantly outperforming existing methods.
    • Provides flexible output format options and local 3D editing functions, which were previously unavailable in other models.
  • Innovative application scenarios

    • Generated 3D assets can be used for complex artistic designs, asset variant generation, and precise manipulation of local areas.

    Key features and demonstrations

    Text-to-3D asset generation

    Image-to-3D asset generation

    Asset variant generation

    Local area manipulation

    Method overview: SLAT and TRELLIS

    Structured LATent Representation (SLAT)

    SLAT combines sparse structures with visual representations:

    • Defines local latent variables on active voxels intersecting the object surface.
    • Combines dense multi-view rendering image features generated by powerful pre-trained visual encoders.
    • Active voxels provide coarse geometry, while visual features capture fine geometry and texture details.

    TRELLIS model architecture

    1. Two-stage generation pipeline

    • Generates the sparse structure of SLAT.
    • Generates latent variables for non-empty cells.
  • Rectified Flow Transformer

    • Adapts to SLAT sparsity and serves as the backbone model.
  • Multi-format output and editing

    • Maps SLAT into high-quality 3D representations through different decoders to meet diverse requirements.

    Applications

    I tried it on HuggingFace, and the results are decent. However, for commercial use, the controllability still falls short.