Today, I saw a voice-driven Avatar project: ByteDance's
INFP is not the 16 personality types (MBTI). The naming of this INFP refers to a framework with the following characteristics: Interactive (strong interactivity), Natural (natural and smooth), Flash (instant response), and Person-generic (high universality).
INFP is a voice-driven interactive avatar generation framework for two-person dialogue scenarios. This framework dynamically generates videos with realistic facial expressions and rhythmic head movements, including both voiced and unvoiced interaction, by inputting stereo dialogue audio and any single portrait image of an agent. INFP combines lightweight design with high performance, making it very suitable for real-time communication scenarios such as video conferencing.
Method
INFP can dynamically switch the agent's "speaking" and "listening" states based on the input of two-person dialogue audio. This process is achieved through two key stages:
, the model learns facial communication behaviors from real dialogue videos, projects them into a low-dimensional motion latent space, and uses motion latent encoding to dynamically generate action representations for static images. The model learns to map the input two-person dialogue audio to motion latent codes through denoising, thereby achieving speech-driven avatar generation in interactive scenarios.
Illustration
Existing interactive avatar generation methods (left figure) require manually assigning roles and explicitly switching role states. In contrast, INFP (right figure) is a unified framework that can dynamically and naturally adapt to various dialogue states.
Example
Motion diversity
The INFP approach can generate fitting motion effects for the same reference image according to different audio inputs.
Out-of-distribution support
Supports generating realistic expressions for non-human avatars and profile images.
Instant messaging
Leveraging INFP's ultra-fast inference capability (over 40 fps on Nvidia Tesla A10), our method enables real-time agent-to-agent communication as well as human-agent interaction.
Comparison with SOTA methods
Interactive avatar generation
Unlike existing methods that require manually switching between the "listener" and "speaker" roles, our method can dynamically adapt to various states, achieving smoother and more natural performances.
Natural adaptation for related tasks
INFP can effortlessly adapt to related tasks such as "talking avatar generation" or "listening avatar generation" without any modifications.
Comparison with other Lip-Sync methods
Talking Avatar Generation
Achieve highly accurate lip-sync. Support for singing video generation. Multilingual generation.
Avatar generation with listening capability.
Generate high-fidelity, natural facial behaviors.