ByteDance's INFP - Lip-sync video generation for dyadic dialogue scenarios

Today, I saw a voice-driven Avatar project: ByteDance's

INFP is not the 16 personality types (MBTI). The naming of this INFP refers to a framework with the following characteristics: Interactive (strong interactivity), Natural (natural and smooth), Flash (instant response), and Person-generic (high universality).

INFP is a voice-driven interactive avatar generation framework for two-person dialogue scenarios. This framework dynamically generates videos with realistic facial expressions and rhythmic head movements, including both voiced and unvoiced interaction, by inputting stereo dialogue audio and any single portrait image of an agent. INFP combines lightweight design with high performance, making it very suitable for real-time communication scenarios such as video conferencing.

Method

INFP can dynamically switch the agent's "speaking" and "listening" states based on the input of two-person dialogue audio. This process is achieved through two key stages:

, the model learns facial communication behaviors from real dialogue videos, projects them into a low-dimensional motion latent space, and uses motion latent encoding to dynamically generate action representations for static images.
The model learns to map the input two-person dialogue audio to motion latent codes through denoising, thereby achieving speech-driven avatar generation in interactive scenarios.

Illustration

Existing interactive avatar generation methods (left figure) require manually assigning roles and explicitly switching role states. In contrast, INFP (right figure) is a unified framework that can dynamically and naturally adapt to various dialogue states.

Example

Motion diversity

The INFP approach can generate fitting motion effects for the same reference image according to different audio inputs.

Out-of-distribution support

Supports generating realistic expressions for non-human avatars and profile images.

Instant messaging

Leveraging INFP's ultra-fast inference capability (over 40 fps on Nvidia Tesla A10), our method enables real-time agent-to-agent communication as well as human-agent interaction.

Comparison with SOTA methods

Interactive avatar generation

Unlike existing methods that require manually switching between the "listener" and "speaker" roles, our method can dynamically adapt to various states, achieving smoother and more natural performances.

Natural adaptation for related tasks

INFP can effortlessly adapt to related tasks such as "talking avatar generation" or "listening avatar generation" without any modifications.

Comparison with other Lip-Sync methods

Talking Avatar Generation

Achieve highly accurate lip-sync.
Support for singing video generation.
Multilingual generation.

Avatar generation with listening capability.

Generate high-fidelity, natural facial behaviors.