OmniHuman-1 is an end-to-end multimodal conditional human video generation framework proposed by ByteDance. It can generate realistic human videos based on a single human image and motion signals (such as audio, video, or a combination of both). Currently, OmniHuman-1 does not provide a public API or download options; only the Paper is available.
Diverse video generation capabilities
The main features are as follows:
: Can generate videos of different human body styles with just a single input image and audio (except for some video-driven examples). etc., applicable to various application scenarios.
Core innovation points
Multimodal motion conditional hybrid training strategy:
Through the hybrid training strategy, the model can utilize data from different modalities (audio, video, etc.) for training, improving the efficiency of data utilization. This method overcomes the limitation of previous end-to-end methods due to the scarcity of high-quality data.
More realistic video generation:
Compared to existing methods, OmniHuman can generate highly realistic human videos based on weaker input signals, especially audio. It supports input images with any aspect ratio, including portraits, half-body, and full-body shots, adapting to the needs of different scenarios.
Specific function demonstration
Speech-driven (Talking)
Supports input images with any aspect ratio. and makes the gestures of people in videos more naturally synchronized with audio. (e.g., TED, Pexels, AIGC).
Diversity
, ensuring that the movements match the unique characteristics of each style.
Half-body gestures (More Half-body Cases with Hands)
Further demonstrates half-body video cases with gestures, emphasizing the fluidity and realism of hand movements.
Portrait video (More Portrait Cases)
test results, conducting experiments with samples from the CelebV-HQ dataset.
Singing (Singing)
and can even adapt to high-pitched songs, adjusting the movement style according to different types of music. The generation quality is closely related to the quality of the reference image.
Compatible with video driving (Video Driving Compatibility)
, OmniHuman not only supports audio-driven but also video-driven, to , control the movements of specific body parts.
Technical architecture
OmniHuman consists of two core parts:
OmniHuman model
Based on and other multi-modal conditional inputs, with the ability to simultaneously integrate multiple modalities for control.
Omni-conditions training strategy
, gradually optimize the model's capabilities based on the complexity of motion-related conditions. By leveraging large-scale multimodal data to enhance the model's generalization ability, the realism and stability of generated videos are improved.
This architecture ensures that OmniHuman can generate high-quality, natural, and smooth human videos under various input conditions.