Sony's MMAudio model — generating sound effects for videos

The developers of MMAudio are: UIUC, Sony AI, and Sony Group Corporation.

MMAudio generates synchronized audio based on video and/or text input. Its key innovation lies in multi-modal joint training, enabling it to be trained on various audio-visual and audio-text datasets. Additionally, MMAudio includes a synchronization module that aligns the generated audio with video frames.

Effect demonstration (audio in the video is generated by MMAudio).

Training

In addition to being trained on audio-visual-(text) datasets, MMAudio also leverages high-quality and rich audio-text data for multi-modal joint training, effectively expanding the scale of the dataset. During inference, MMAudio can generate audio aligned with conditions guided by video and/or text.

Overview of the MMAudio prediction network process

Video conditions, text conditions, and audio latent variables interact within the multi-modal transformer network. The synchronization model injects frame-aligned synchronization features to ensure precise audio-visual synchronization.

Spectrogram comparison of generated audio

Compare the spectrograms of generated audio with those from other methods and real audio. Notably, the sound effects generated by MMAudio's method are closest to the real audio, while other methods often produce sounds that do not match the visual input and do not exist in the real audio.

Trial link

HuggingFace - https://huggingface.co/spaces/hkchengrex/MMAudio
Colab - https://colab.research.google.com/drive/1TAaXCY2-kPk4xE4PwKB3EqFbSnkUuzZ8
Replicate - https://replicate.com/zsxkib/mmaudio