Alibaba's semantic recognition model SenseVoice and voice generation model CosyVoice

Alibaba's Tongyi large model speech team (speechteam) released two speech-related models and one application framework this month:

: For high-precision multilingual speech recognition, emotion recognition, and audio event detection, SenseVoice provides extremely low latency and supports over 50 languages.
: For natural voice generation, with capabilities in multiple languages, tone, and emotion control, CosyVoice performs well in multilingual voice generation, zero-shot voice generation, cross-language voice cloning, and instruction-following abilities.
: By combining these two models, some specific applications have been developed, aiming to enhance the framework for natural voice interaction between humans and large language models.

Application scenarios

Speech-to-speech translation

By integrating SenseVoice, LLMs, and CosyVoice, we can easily achieve speech-to-speech translation (S2ST).

Case 1:

Original audio: Yes, so regarding your current situation, if you say you can handle this bill, then find a way to deal with it.

Generated translated audio:

Case 2:

Original audio:

Generated translated audio: Subsequently, we will fully acquire that company. Therefore, aligning the interests of the management team with the assets entering the family is one reason why we do not buy everything.

Emotional voice chat

By integrating SenseVoice, LLMs, and CosyVoice, we can develop an emotional voice chat application. In the following example, both user and assistant content are synthesized by CosyVoice.

Case:

User: Good morning Assistant: Generated style: warm tone, slow speaking rate, friendly emotion. Broadcast content: Good morning, how's your spirit today? User: I'm particularly happy today Assistant: Generated style: high pitch, fast speaking rate, happy emotion. Broadcast content: [breath] Oh wow, hearing you're so happy[breath] makes me happy too, what good thing happened? User: We're going to release a new voice large model Assistant: Generated style: normal tone, medium speaking rate, with a hint of amusement. Broadcast content: [laughter] That's great, I heard you're releasing a new voice large model, it's exciting news! User: Say a line from Bald Qiang sadly Assistant: Generated style: low pitch, slow speaking rate, sad emotion. Broadcast content: "Sigh--- looks like I'll be coming back empty-handed again---" User: Amazing Assistant: Generated style: positive tone, moderate speaking rate, encouraging emotion. Broadcast content: That's really amazing, keep up the good work!

Interactive podcast

By integrating SenseVoice, a multi-agent system based on LLMs with real-time world knowledge, and CosyVoice, we can create an interactive podcast.

Case:

Assistant_Xiaxia: AI host Xiaxia brings you to the forefront of knowledge. Hello everyone, welcome to our podcast program, I am host Xiaxia. Assistant_Dadong: Hello, everyone, I am your funny guy Dadong. Today's topic is very explosive, which is the lip-syncing controversy at Mayday's concert. Assistant_Xiaxia: Yeah, this topic has sparked quite a bit of controversy recently. First, let's look at some evidence and analysis leaked by netizens. Some video clips from Mayday's concert were uploaded by netizens, and from these videos, it can be seen that lead singer Ashin was clearly lip-syncing, and even some parts didn't match his mouth movements. Assistant_Dadong: Haha, I watched that video, it was indeed awkward. I remember there was a shot where Ashin put down the microphone. User: But isn't this unprofessional? Assistant_Dadong: The issue of professionalism is worth discussing. Mayday has always been known for their live performance strength. If the band really lip-synced, fans might feel disappointed, but sometimes considering music quality or artist health conditions, some measures may be taken, but fans usually prefer to see real live performances, don't they? What do you think? User: Not much meaning, change it. Assistant_Dadong: Next, we move on to the next topic, which is very popular, that is, the poster for Thailand's match against China in the 36-strong tournament, truly full of anticipation and tension. Assistant_Xiaxia: Exactly, when I saw that poster, my heart was filled with mixed feelings. Our national football team last lost big to Thailand, that 1-5 score was a nightmare. However, using this as publicity this time, I think our young football players must stay alert and not let history repeat itself. Assistant_Dadong: True, the poster also lists situations where Thailand defeated our national team during the 2013 friendly match and the 2019 China Cup, Thailand certainly wasn't polite.

Expressive audiobook narration

Through the analytical capabilities of LLMs to structure and identify emotions in books, combined with the synthesis capabilities of CosyVoice, we can achieve more expressive audiobooks.

Case:

The Legend of the Condor Heroes, Chapter Nine, A Hundred Ways to Avoid the Enemy.
Lu Wushuang was just in a panic when she suddenly asked a silly question, angrily saying, "Idiot! What nonsense are you talking about?"
Yang Guo laughed and said, "Let's play the wedding ceremony game, will you play the bride? That would be beautiful! With a red cloth covering your face, others won't be able to see you no matter what they say."
Lu Wushuang paused and said, "Are you teaching me to dress as a bride to escape my master?"
Yang Guo chuckled and said, "I don't know, you dress as a bride, and I'll dress as the groom."

Two models:

Speech understanding model: SenseVoice

SenseVoice is a speech foundation model with various speech understanding capabilities, including automatic speech recognition (ASR), language identification (LID), emotion recognition (SER), and audio event detection (AED). SenseVoice-Small is a speech foundation model that uses only an encoder for rapid speech understanding; SenseVoice-Large is an encoder-decoder speech foundation model that supports more languages for more accurate speech understanding.

Speech generation model: CosyVoice

In summary, CosyVoice includes an autoregressive transformer for generating corresponding speech tokens for input text; an ODE-based diffusion model that reconstructs Mel spectrograms from generated speech tokens through flow matching; and a HiFTNet-based vocoder to synthesize waveforms. Dashed modules are optional in specific model usage, such as cross-lingual, SFT inference, etc.

Try out the CosyVoice model

I found a Google Colab notebook, ran it, and used Trump's voice characteristics to generate a text-to-speech segment.

Original audio:

GeneratedaudioAudio：

The most incredible thing was that I happened to not only turn but to turn at the exact right time and in just the right amount, I’m supposed to be dead, I’m not supposed to be here