This is a new model released yesterday, named FoleyCrafter, which is a text-based video-to-audio generation framework capable of producing high-quality audio that is semantically relevant and temporally synchronized with the input video. This model was jointly developed by a team from the Shanghai Artificial Intelligence Laboratory and the Chinese University of Hong Kong (Shenzhen).
Effect video (the sound in the video is generated by FoleyCrafter)
Research methodology
FoleyCrafter is built on top of a pre-trained text-to-audio (T2A) generator, ensuring high-quality audio synthesis. It includes two main components: the Semantic Adapter (S.A.) and the Temporal Controller, the latter consisting of the Timestamp Detector (T.D.) and the Temporal Adapter (T.A.). Both the Semantic Adapter and the Temporal Controller are trainable modules that take input videos to synthesize audio and are optimized via audio supervision. The T2A model remains fixed to maintain its established capability for high-quality audio synthesis.
Peer comparison
One significant advantage of FoleyCrafter is its compatibility with text prompts, which allows users to achieve controllable and diverse video-to-audio generation using text descriptions. We conducted extensive quantitative and qualitative experiments on standard benchmarks to validate the effectiveness of FoleyCrafter.
Semantic alignment and audio quality:
Temporal synchronization:
Example demonstration
The model can be run on HuggingFace; I tried it myself, and the results were pretty good.
I created a video of the game "Ragnarok Kart":