Advertisement

Stable Audio 2.0 audio generation - hum a melody to let AI generate music

In yesterday's sharing, we discussed the audio section of the Stanford AI report. When the report was written, Stable Audio 2.0 had not yet been released. The latest Stable Audio 2.0 was officially launched on April 3.

Introduction to Stable Audio

Stable Audio 1.0 was released in September 2023 and is the first commercially available AI music generation tool that can produce high-quality 44.1kHz music using latent diffusion technology. It was named one of the best inventions of 2023 by Time magazine.

Version 2.0 was developed based on version 1.0. Through dual prompt functions of text-to-audio and audio-to-audio, users can create melodies, backing tracks, audio track separations, and sound effects, thereby enhancing the creative process. Compared with other top models, the unique feature of Stable Audio 2.0 lies in its ability to generate full songs up to three minutes long, including structured compositions with introductions, developments, and conclusions, as well as stereo sound effects. The new model is freely available on the official Stable Audio website and will soon be accessible via the Stable Audio API.

Main Functions

Full-length audio track

Stable Audio 2.0 stands out among top-tier models, capable of generating complete songs up to three minutes long, including well-structured compositions such as introductions, developments, and conclusions, along with stereo sound effects.

Audio-to-audio generation

Stable Audio 2.0 supports uploading audio files to transform creative ideas into finished samples. Simply put, you can hum a melody🎶, and AI will generate corresponding music🎵 for you. (Isn't that amazing?)

Sound variation and effect creation

This model enhances the generation capabilities of sound and audio effects, from keyboard typing sounds to crowd cheers or the hum of city streets, offering new ways to elevate audio projects.

Style Transfer

This new feature can seamlessly modify newly generated or uploaded audio during the generation process. This capability allows users to customize the theme of the output to match the style and tone of a specific project.

Research Methodology

A diffusion transformer (DiT) similar to that used in Stable Diffusion 3 has replaced the previous U-Net due to its higher efficiency in processing long sequence data. The combination of these two elements enables the model to identify and reproduce large-scale structures critical for high-quality musical works.

Technical Details

The autoencoder compresses audio and restores it to its original state, capturing and reproducing key features while filtering out less important details, thus achieving more coherent generation.

The diffusion transformer (DiT) gradually refines random noise into structured data, identifying complex patterns and relationships. When combined with the autoencoder, it can handle longer sequences, creating deeper and more accurate interpretations from the input.