The third Audio model mentioned in the report is MusicLM, which was released by Google.
This is a model capable of generating high-fidelity music from textual descriptions, such as "a calm violin melody accompanied by distorted guitar improvisation." MusicLM defines conditional music generation as a hierarchical sequence-to-sequence modeling task and can generate coherent music for several minutes at a high fidelity of 24 kHz. Experiments show that MusicLM surpasses previous systems in audio quality and accuracy in following text descriptions. Additionally, MusicLM can generate music based on both text and melody conditions; it can transform whistled or hummed melodies into styles described in the text. Google publicly released the MusicCaps dataset, which contains 5,500 music-text pairs with rich textual descriptions provided by human experts.
What can be done includes:
: MusicLM can generate audio from detailed text descriptions, which are not limited to simple texts but can include various contexts and emotional layers.
prompt: The main soundtrack of an arcade game. It is fast-paced and upbeat, with a catchy electric guitar riff. The music is repetitive and easy to remember, but with unexpected sounds, like cymbal crashes or drum rolls.
: The model is capable of generating long-duration musical works while maintaining thematic and stylistic consistency.
: By providing a series of text prompts, MusicLM can generate corresponding audio. These text prompts influence how the model inherits from the previous description and continues to generate semantic tokens.
Text prompts
time to meditate (0:00-0:15)
time to wake up (0:15-0:30)
time to run (0:30-0:45)
time to give 100% (0:45-0:60)
: By incorporating melody embeddings into the conditions, MusicLM can generate music that conforms to both the text prompt and the provided melody.
: The model can also conditionally generate music based on descriptions related to paintings, further expanding its use cases.
: Users can specify different instruments, musical styles, musician experience levels, locations, historical periods, or even accordion solos to generate short audio clips.
: We tested the diversity of generated samples while keeping the conditions and/or semantic tokens unchanged, ensuring richness and variability in the output.
More 🔊Demo🔊 sounds can be listened to here: https://google-research.github.io/seanet/musiclm/examples/