Sketch2Sound: Controllable audio generation based on time-varying signals and sound imitation

Sketch2Sound is a generative audio model jointly developed by Adobe and Northwestern University in the United States. Currently, there is a research paper available, but it has not been open-sourced or made available for trials yet.

Overview

Sketch2Sound is an innovative generative audio model capable of producing high-quality sound effects through the following combined methods:

Time-varying control signals：

Loudness
(Spectral Centroid)
Pitch Probability

: Achieve semantic-level control of audio generation.
: Generate customized sound effects by imitating or referencing vocal sounds and their shapes.

Main features

Lightweight implementation

Based on the text-to-audio Latent Diffusion Transformer (DiT).
With only 40k fine-tuning steps and a single linear layer for each control signal, the computational cost is relatively low.
Compared to methods like ControlNet, it is more efficient.

Flexible random median filtering training

: During training, random median filtering is applied to the control signals.
: Enhances the flexibility of temporal resolution for input signals, capable of processing signals with varying time precision.

Consistency in input control

The model can generate sound effects based on the "intention" of sound imitation while meeting the semantic requirements of text prompts.
The quality of the output sound effects is comparable to baseline methods based purely on text generation.

Application scenarios

: Provides for sound artists:

Semantic flexibility based on text prompts.
Expressiveness and accuracy based on sound imitation.

: Especially suitable for generating audio content synchronized with video or in interactive environments.

Workflow

Extract three key control signals from the imitation of any input sound (such as vocal imitation or reference to sound shape):

**Loudness**: Describes the volume variation of sound.
**Spectral Centroid**: Represents the brightness or clarity of the sound.
**Pitch Probabilities**: Describes the pitch characteristics of the sound.

Signal encoding

Encode the aforementioned control signals into latent features so that the model can understand and process them.

Integrated with the generation system

text-to-speech generation system.

Output audio

The system generates semantically accurate and high-quality audio based on control signals and text prompts.

Control curve semantics

Input audio

Output audio

" (Forest environment), Sketch2Sound exhibits the following behavior:

Interpretation of Control Curves

In the input control signal, the peaks of loudness (Loudness) are interpreted by the model as representations of bird calls.
The model can independently associate these loudness peaks with common bird call sound effects in the environment.

Without Explicit Prompting

Even without explicit mention of bird calls in the text prompt, the model is still able to generate bird sounds that fit the scene.
This indicates that the model not only relies on control signals in audio generation, but also can infer scene-related details through semantic understanding.

Result

In the generated sound effects, the atmosphere of the forest is authentically reproduced through ambient sounds (such as wind and rustling leaves) and natural sounds (such as bird calls).

Example

Generate synchronized sound effects for videos by imitating sounds: Create high-quality sound effects that match the visual content by combining vocal imitation with text prompts.