Microsoft's TRELLIS: A high-quality 3D asset generation model

Microsoft has recently proposed a generative method for creating high-quality 3D assets, based on a unified Structured LATent (SLAT) representation and Rectified Flow Transformers, achieving flexible and efficient 3D generation.

Core of the paper

Unified Structured LATent Representation (SLAT)：

SLAT combines sparse 3D meshes with dense multi-view features extracted from vision foundation models.
Captures geometric structure and textural information, supporting multiple decoding formats including Radiance Fields, 3D Gaussians, and Meshes.
Provides flexible decoding capabilities to output diverse 3D formats according to different needs.

Powerful generative model architecture：

Uses a Rectified Flow Transformer specifically designed for SLAT as the core model.
Trained on a large-scale dataset of 3D assets containing over 500,000 diverse objects, with a parameter scale reaching up to 2 billion.

Flexible generation and editing capabilities：

Supports generating high-quality 3D assets through text or image inputs, significantly outperforming existing methods.
Provides flexible output format options and local 3D editing functions, which were previously unavailable in other models.

Innovative application scenarios：

Generated 3D assets can be used for complex artistic designs, asset variant generation, and precise manipulation of local areas.

Key features and demonstrations

Text-to-3D asset generation

Image-to-3D asset generation

Asset variant generation

Local area manipulation

Method overview: SLAT and TRELLIS

Structured LATent Representation (SLAT)

SLAT combines sparse structures with visual representations:

Defines local latent variables on active voxels intersecting the object surface.
Combines dense multi-view rendering image features generated by powerful pre-trained visual encoders.
Active voxels provide coarse geometry, while visual features capture fine geometry and texture details.

TRELLIS model architecture

Two-stage generation pipeline：

Generates the sparse structure of SLAT.
Generates latent variables for non-empty cells.

Rectified Flow Transformer：

Adapts to SLAT sparsity and serves as the backbone model.

Multi-format output and editing：

Maps SLAT into high-quality 3D representations through different decoders to meet diverse requirements.

Applications

I tried it on HuggingFace, and the results are decent. However, for commercial use, the controllability still falls short.