Sesame CSM - An incredibly powerful speech model, soon to be open-sourced~

I recently came across a very impressive voice model that is about to be open-sourced. I'd like to share its demo with you: "Crossing the uncanny valley of conversational voice," which will be released on February 27, 2025, by Brendan Iribe, Ankit Kumar, and the Sesame team.https://github.com/SesameAILabs/csm

First, listen to a conversation between me and it (the one with a bit of an accent and not so pleasant voice is me):

Achieving "Voice Presence"

How do we determine if someone truly understands us? This is not only reflected in language but also in the subtle nuances of voice: the rising intonation when excited, the appropriate pauses during deep thought, and the warm comfort conveyed.

Voice is the most intimate medium of human communication, conveying rich and delicate meanings through countless subtle variations in tone, pitch, rhythm, and emotion.

However, existing digital voice assistants lack this crucial quality, making it difficult for them to truly integrate into our daily lives. After the initial novelty wears off, the monotonous and neutral voice gradually becomes tiring.

The goal of the Sesame team is to achieve true "Voice Presence" — making voice interactions feel real, understanding, and valuable. They are creating not just a tool for processing commands, but a partner capable of engaging in genuine conversations, building trust through continuous interaction, and thus fully unlocking the enormous potential of voice as a human-computer interaction medium.

Key elements for achieving Voice Presence:

Emotional Intelligence: Understanding and responding to emotions in the conversation;
Dialogue Dynamics: Natural pacing, appropriate pauses, interruptions, and emphasis;
Context Awareness: Adjusting tone and style according to specific contexts;
Consistent Personality Traits: Maintaining a coherent, reliable, and appropriate expression of character.

CSM New Voice Model

The Sesame team has proposedConversational Speech Model（CSM）a new voice model using an end-to-end multimodal learning framework based on Transformers. The key innovations include:

End-to-end voice interaction learning using a multimodal Transformer architecture;
Incorporating multiple dimensions such as language and prosody for voice inference;
Going beyond the limitations of traditional public evaluation datasets and adopting more rigorous assessment methods to continuously improve model performance.

Through CSM, the team has taken a significant step towards achieving "Voice Presence," transforming AI voice from a monotonous command responder to a truly interactive, emotionally and contextually aware conversational partner.

Detailed Model Architecture

Conversational Speech Model（CSM）is a multimodal voice model implemented using two autoregressive Transformers based on the RQ-Transformer approach. Unlike previous methods, the Sesame team splits the Transformer at the zeroth codebook layer.

The first multimodal backbone network processes interleaved text (Text) and audio (Audio) sequences to predict the zeroth codebook. The second audio decoder then generates the audio information for layers 1 to N-1 based on the predicted zeroth codebook.

Specific Model Operation Mechanism

In practice, tokens of text (T) and audio (A) are input alternately into the Backbone network, which predicts the content of the zeroth codebook. Subsequently, the Decoder, based on the predicted zeroth codebook, autoregressively samples and generates the content of codebooks from layer 1 to N-1, ultimately reconstructing the audio.

The team used two Tokenizers:

Text Tokenizer: Using the Llama tokenizer[6] to generate text tokens;
Audio Tokenizer: Using Mimi (a split-RVQ Tokenizer) to produce one semantic codebook and N-1 acoustic codebooks per frame at a frequency of 12.5Hz.

The structure of the training data samples is an alternating pattern of text and audio.

Compute Amortization Scheme

Due to the high memory requirements of the model, even with a smaller model size, training speed can be significantly slowed down, affecting scalability and experimental speed. To address this, the team adopteda compute amortization schemeto alleviate memory bottlenecks:

The Backbone Transformer predicts the zeroth codebook over all audio frames;
The Decoder only predicts the remaining codebooks from layer 1 to N-1, but only for randomly selected 1/16 of the frames, significantly reducing the memory consumption required for training and alleviating the bottleneck during model scaling.

Specifically:

The Backbone models the zeroth codebook across all frames (highlighted in blue).
The Decoder only predicts the remaining codebook content and calculates loss for randomly sampled 1/16 of the frames (marked in green).

This method retains the integrity and fidelity of the RVQ codebooks while alleviating memory constraints, improving training speed and scalability, and facilitating faster experimental iterations.

Model Scale and Training Details

Three model sizes were designed:

Small: Backbone 3B parameters, Decoder 250M parameters
Medium: Backbone 8B parameters, Decoder 300M parameters
Tiny: Backbone 1B parameters, Decoder 100M parameters

All models were trained with a sequence length of 2048 (corresponding to approximately 2 minutes of audio content), and each model was trained for five epochs.

Sample Demonstrations

Paralinguistics

Sample Audio 1
Sample Audio 2

Foreign Language

Sample Audio 1
Sample Audio 2

Contextual Expressivity

Sample Audio 1
Sample Audio 2

Note: The model demonstrates how it precisely adjusts intonation, speech rate, and emotional expression based on context, making the voice content more realistic.

Pronunciation Correction Examples

Sample Audio 1
Sample Audio 2

Note: The pronunciation correction examples are real recordings, while the rest of the audio is generated by the model.

Multiple Speakers Example

Sample Audio

Note: Based on audio prompts from two speakers, the model generates a natural and smooth multi-speaker dialogue in a single pass.

Model Evaluation

To objectively evaluate the performance of the CSM model, both objective and subjective evaluation methods were employed. Objective evaluations used traditional metrics (such as Word Error Rate) and novel semantic and pronunciation tests; subjective evaluations utilized the Expresso dataset, where listeners provided comparative subjective scores (CMOS) to measure the model's performance in emotional expression and contextual appropriateness.

📌 Objective Evaluation

Traditional evaluation metrics such as Word Error Rate (WER) and Speaker Similarity (SIM) have reached saturation, with modern models like CSM approaching human-level performance on these metrics.

To further demonstrate the model's performance, the following two more challenging tests were introduced:

Homograph Disambiguation
This test checks whether the model can correctly pronounce homographs, words that are spelled the same but pronounced differently in different contexts, such as the English word "lead" which can be pronounced /lɛd/ (metal lead) or /liːd/ (to lead).
Pronunciation Consistency Test
This test evaluates the model's stability in pronouncing the same word in different contexts, such as the words "route," "data," and "caramel," which have different common pronunciation variants in English.

The following figure shows the performance comparison of various models in the above tests:

The left side shows the results of the Homograph test;
The right side shows the results of the Consistency test.

(Comparison with the default configurations of Play.ht, Elevenlabs, and OpenAI models.)

Overall, larger models tend to have higher pronunciation accuracy, which aligns with the team's hypothesis that larger models can produce more realistic speech synthesis.

📌 Subjective Evaluation (Expresso Dataset)

For subjective evaluation, the Expresso dataset was chosen, which includes a wide variety of emotional and prosodic variations, making it ideal for assessing the naturalness and contextual appropriateness of speech. Listeners rated the model-generated audio against real human recordings on a 7-point scale (CMOS). Multiple listeners were invited, each evaluating an average of 15 samples.

No Context Condition: Listeners judged which audio sounded more like a real human voice without any specific context;
With Context Condition: Listeners judged which audio was more suitable for the given context.

The results are as follows:

In the no-context scenario, the naturalness of the model-generated speech is already close to that of real human speech, with little difference between the models.
In the context-based scenario, listeners tended to prefer real human recordings. This indicates that there is still a gap between the models and real humans in terms of prosody and contextual matching in conversational speech generation.

📌 Model Limitations and Future Plans

The current CSM model is primarily trained on English data, and although it has some cross-lingual capabilities due to the presence of a small amount of other language data in the dataset, it has not yet reached the desired level. Additionally, the model currently only achieves high-quality generation of text and speech content and cannot effectively model deeper dialogue structures, such as turn-taking, pacing, and pausing in multi-speaker conversations.

To further enhance the capabilities of CSM, the team plans to:

Expand Multilingual Support
Improve the model's performance in multilingual scenarios by training on more diverse datasets to enhance its cross-lingual capabilities.
Integrate Deep Text and Speech Interaction
In the future, the CSM model will be further expanded to capture the full structure of multi-speaker conversations, including turn-taking and dialogue pacing.
Combine the Advantages of Speech and Text
The team is exploring a new AI architecture that allows the model to deeply understand both text and speech information, further narrowing the gap between generated and real human conversations.