DisPose is a controllable human image animation method that improves video generation through motion field guidance and keypoint correspondence, developed in collaboration by several universities including Peking University, University of Science and Technology of China, Tsinghua University, and Hong Kong University of Science and Technology.
Introduction
Controllable human image animation aims to generate animated videos using reference images and driving videos. Recent studies have attempted to introduce additional dense conditions (such as depth maps) to ensure motion alignment due to the limited control signals provided by sparse guides like skeleton poses. However, when there are significant differences between the body shapes of the reference person and the driving video, such strict dense guidance can affect the quality of the generated video. This paper proposes DisPose, which aims to uncover more generalizable and effective control signals without requiring extra dense inputs. Specifically, it decouples sparse skeleton poses in human image animation into motion field guidance and keypoint correspondence.
Specifically, DisPose generates a dense motion field from sparse motion fields and reference images, providing regional-level guidance while maintaining the generalization ability of sparse pose control. Additionally, DisPose extracts diffusion features corresponding to pose keypoints from the reference image and transfers these point features to the target pose to provide unique identity information. To smoothly integrate into existing models, DisPose proposes a plug-and-play hybrid ControlNet that enhances the quality and consistency of generated videos while keeping the parameters of existing models frozen. Extensive qualitative and quantitative experiments demonstrate that DisPose has significant advantages over current methods.
Examples
Technical Framework
DisPose is a plug-and-play guidance module for decoupling pose guidance, extracting robust control signals from skeleton pose maps and reference images alone, without requiring additional dense inputs. Specifically, DisPose decouples pose guidance into motion field estimation and keypoint correspondence.
DisPose computes sparse motion fields using skeleton poses. DisPose introduces reference-based dense motion fields, providing region-level motion signals via conditional motion propagation on the reference image. To enhance appearance consistency, DisPose extracts diffused features corresponding to keypoints in the reference image and transfers these point features to the target pose by calculating multi-scale point correspondences along motion trajectories. Architecturally, DisPose implements these decoupled control signals in a ControlNet-like manner, integrating them into existing methods. The motion fields and point embeddings are injected into the latent video diffusion model, thereby generating accurate human image animations.