V2A combines video pixels with natural language text prompts to generate rich soundscapes for on-screen actions. V2A technology can be paired with video generation models like Veo to create scenes with dramatic scores, realistic sound effects, or dialogues matching the video characters and tone; it can also generate soundtracks for various traditional materials, including archival footage and silent films, opening up broader creative opportunities.
Prompt for audio: Cinematic, thriller, horror film, music, tension, ambience, footsteps on concrete
Enhanced creative control
Importantly, V2A can generate an infinite number of soundtracks for any video input. Users can choose to define "positive prompts" to guide the generation of desired sounds, or "negative prompts" to avoid undesired sounds. This flexibility gives users more control over V2A's audio output, allowing them to quickly experiment with different audio outputs and select the best match.
Prompt for audio: A spaceship hurtles through the vastness of space, stars streaking past it, high speed, Sci-fi
Prompt for audio: Ethereal cello atmosphere
Prompt for audio: A spaceship hurtles through the vastness of space, stars streaking past it, high speed, Sci-fi
How it works
Experiments were conducted on autoregressive and diffusion methods to find the most scalable AI architecture. It was found that diffusion-based methods provided the most realistic and engaging results in audio generation, capable of synchronizing video and audio information. The V2A system first encodes the video input into a compressed representation, then the diffusion model iteratively refines audio from random noise. This process is guided by visual input and natural language prompts, generating synchronized and realistic audio closely aligned with the prompts. Finally, the audio output is decoded into audio waveforms and combined with the video data.
The V2A system generates audio waveforms synchronized with the video through video pixel and audio prompt inputs.First, V2A encodes the video and audio prompt inputs and processes them iteratively through a diffusion model.Then it generates compressed audio, which is decoded into audio waveforms.To generate higher-quality audio and increase the ability to guide the model to produce specific sounds, more information was added during training, including AI-generated annotations with detailed sound descriptions and dialogue records.By training on videos, audio, and additional annotations, our technology learns to associate specific audio events with various visual scenes while responding to the information provided in the annotations or dialogue records.
Remaining challenges
Since the quality of the audio output depends on the quality of the video input, artifacts or distortions in the video (parts outside the model's training distribution) may significantly degrade the audio quality. Lip synchronization in speaking videos is still being improved. V2A attempts to generate speech based on input dialogue records and synchronize it with the lip movements of the characters. However, the paired video generation model may not have been adjusted according to the dialogue record, leading to mismatches and often uncomfortable lip-sync issues due to the mismatch between the mouth movements generated by the video model and the dialogue record.
Prompt for audio: Music, Transcript: “this turkey looks amazing, I’m so hungry”