Advertisement

ByteDance's OmniHuman-1: Generating realistic human videos from a single human image

OmniHuman-1 is an end-to-end multimodal conditional human video generation framework proposed by ByteDance. It can generate realistic human videos based on a single human image and motion signals (such as audio, video, or a combination of both). Currently, OmniHuman-1 does not provide a public API or download options; only the Paper is available.

Diverse video generation capabilities

The main features are as follows:

  • : Can generate videos of different human body styles with just a single input image and audio (except for some video-driven examples).
  • etc., applicable to various application scenarios.

Core innovation points

  1. Multimodal motion conditional hybrid training strategy

  • Through the hybrid training strategy, the model can utilize data from different modalities (audio, video, etc.) for training, improving the efficiency of data utilization.
  • This method overcomes the limitation of previous end-to-end methods due to the scarcity of high-quality data.
  • More realistic video generation

    • Compared to existing methods, OmniHuman can generate highly realistic human videos based on weaker input signals, especially audio.
    • It supports input images with any aspect ratio, including portraits, half-body, and full-body shots, adapting to the needs of different scenarios.

    Specific function demonstration

    Speech-driven (Talking)

    • Supports input images with any aspect ratio.
    • and makes the gestures of people in videos more naturally synchronized with audio.
    • (e.g., TED, Pexels, AIGC).

    Diversity

    • , ensuring that the movements match the unique characteristics of each style.

    Half-body gestures (More Half-body Cases with Hands)

    • Further demonstrates half-body video cases with gestures, emphasizing the fluidity and realism of hand movements.

    Portrait video (More Portrait Cases)

    • test results, conducting experiments with samples from the CelebV-HQ dataset.

    Singing (Singing)

    • and can even adapt to high-pitched songs, adjusting the movement style according to different types of music.
    • The generation quality is closely related to the quality of the reference image.

    Compatible with video driving (Video Driving Compatibility)

    • , OmniHuman not only supports audio-driven but also video-driven, to
    • , control the movements of specific body parts.

    Technical architecture

    OmniHuman consists of two core parts:

    1. OmniHuman model

    • Based on
    • and other multi-modal conditional inputs, with the ability to simultaneously integrate multiple modalities for control.
  • Omni-conditions training strategy

    • , gradually optimize the model's capabilities based on the complexity of motion-related conditions.
    • By leveraging large-scale multimodal data to enhance the model's generalization ability, the realism and stability of generated videos are improved.

    This architecture ensures that OmniHuman can generate high-quality, natural, and smooth human videos under various input conditions.