Advertisement

Champ - Generating animations from static human image pictures

Today, I will share a paper from Nanjing University, Fudan University, and Alibaba: "Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance".

Effect

Project Introduction

Champ is a method for human image animation that uses a 3D parametric human body model within a latent diffusion framework to enhance shape alignment and motion guidance in current human generation technologies. This method leverages the SMPL (Skinned Multi-Person Linear) model as a 3D parametric human body model, establishing a unified representation of body shape and posture. This helps accurately capture complex human geometry and motion features from the source video. Specifically, Champ integrates rendered depth images, normal maps, and semantic maps obtained from SMPL sequences, along with skeleton-based motion guidance, enriching detailed 3D shapes and pose attributes into the conditions of the latent diffusion model. Champ employs a multi-layer motion fusion module incorporating self-attention mechanisms to fuse shape and motion latent representations in the spatial domain. By representing the 3D human parametric model as motion guidance, Champ can perform parameterized shape alignment between reference images and source video motions. Experimental evaluations conducted on benchmark datasets show that this method generates high-quality human animations that accurately capture pose and shape changes. Additionally, Champ demonstrates superior generalization capabilities on the proposed wild dataset.

Showcases

The proposed method showcases a novel capability to generate temporally coherent and visually realistic human image animations by utilizing reference images and pre-defined action sequences combined with a 3D parametric human body model. Furthermore, it demonstrates enhanced abilities to improve shape alignment and motion guidance in the generated videos. This approach facilitates animating various types of characters, including portraits that exhibit significant domain changes, such as:

(a) A neoclassical oil painting depicting a woman wearing a white dress and fur coat.

(b) A watercolor portrait of a woman.

(c) An oil painting titled "Queen of Armenia."

In addition, it is capable of animating characters derived from text-to-image diffusion models, including the following prompts:

(d) A portrait of a woman in a yellow dress, heavy metal comic book cover art, space theme.

(e) A woman posing in a silver dress, popular on CG Society, futuristic, bright blue eyes.

(f) A realistic depiction of Aang from The Last Airbender, showcasing his mastery of all bending elements in the powerful Avatar State.

Framework

Multi-level motion conditions and their corresponding cross-attention maps.

Each set of images (top) includes representations of depth maps, normal maps, semantic maps, and DWpose skeletons rendered from the corresponding SMPL sequences. Subsequent images (bottom) display the outputs guided by self-attention.

Comparison

Qualitative comparison with state-of-the-art methods on benchmark datasets.

Qualitative comparison of animating unseen domain images.

Comparison on shape change data.