. Today, we will take a deeper look into its upgraded version — EMO2. This method can simultaneously generate highly expressive facial expressions and gestures.
Two stages
, a diffusion model is used to synthesize video frames, integrating the hand motions generated in the first stage to produce realistic facial expressions and body movements.
This two-stage approach effectively addresses the issue of weak correlations between audio and full-body motion, significantly enhancing the realism and expressiveness of the generated videos.
Diverse video generation
Singing
By inputting a single character image and sound audio (such as singing audio), EMO2 can generate virtual character singing videos with rich facial expressions and diverse body postures, showcasing high expressiveness.
Speaking
EMO2 supports voice input in multiple languages and can intuitively identify tone changes in the audio, thus bringing images to life and generating dynamic and expressive virtual character speaking videos.
Hand dancing
EMO2 can generate complex and smooth hand motions, infusing vitality into virtual characters and presenting lifelike performance effects.
Role-playing
EMO2 enables designated characters to perform relevant scripts in movie or game scenarios and can accurately express performances consistent with their personality traits based on character settings.
Methodology
The team's method draws inspiration from the similarities between human movement and robotic movement. Similar to robotic motion, human motion often uses "end-effectors" (End-Effector, EE) as the core driving force. Specifically, hand motions (EE) are planned for target contexts, while the rest of the body collaborates and adjusts according to the hand motions through inverse kinematics principles.
This method abstracts the planning and coordination process of human movement, providing a theoretical basis for generating more natural gestures and body movements.
Comparison