Text - Speech
As a market leader in text-to-speech (TTS), ElevenLabs achieved unicorn status at the beginning of the year with a valuation of $1.1 billion. With large research institutions cautiously entering this field, ElevenLabs has managed to occupy much of the market space.
In addition to its flagship TTS product, ElevenLabs has expanded into foreign language dubbing, voice isolation features, and even previewed an early text-to-music generation model. Due to copyright protection concerns, the model has not been officially released, but ElevenLabs has provided an API interface for sound effect generation.
Currently, 62% of Fortune 500 companies have at least one employee using ElevenLabs' products.
Meanwhile, frontier labs have taken a more cautious approach to this area, possibly due to concerns about negative reactions that could arise from the misuse of voice generation technology. For example, GPT-4o's voice output is limited to preset voices, while OpenAI has stated it has not yet decided whether to widely roll out its voice engine (which reportedly can replicate voices based on a 15-second recording).
On the other hand, Cartesia is betting on state-space models to achieve more efficient voice synthesis technology.
Speech - Text
Unlike the "stunning effect" brought by text-to-speech, speech recognition has more potential for scaled automation, which can help reduce a large number of repetitive tasks. Investors are gradually seeing its potential for large-scale applications.
A series of startups dedicated to speech recognition technology have successfully completed multiple rounds of financing in the past year. These companies apply speech recognition in customer support, call centers, and other scenarios, including Assembly AI ($50 million), Deepgram ($72 million), PolyAI ($50 million), and Parloa ($66 million). Among them, PolyAI's revenue is expected to triple this year.
These startups focus on alleviating the shortage of personnel in call centers and are committed to making customer interactions more natural, capable of handling corrections, pauses, interruptions, and topic switches — all areas that traditional automated systems struggle with.
Although AI-driven transcription and audio analysis technologies are not new concepts, their accuracy continues to improve thanks to larger datasets and the application of Transformer models.
For example, Assembly AI’s Universal-1 multilingual model, trained on 12.5 million hours of speech data, offers faster speeds, lower computational resource requirements, fewer error rates, and better environmental noise filtering performance compared to OpenAI’s Whisper.
Speech - Speech
For more than a decade, consumer-grade voice assistant experiences provided by Alexa and Siri have generally been underwhelming. However, OpenAI's GPT-4o and Mochi, the voice assistant introduced by Kyutai based in Paris, have successfully crossed the "uncanny valley." These two systems are capable of thinking while speaking, ensuring smooth interactions between users and voice assistants. OpenAI demonstrated how two phones running GPT-4o could engage in coherent and engaging conversations; Mochi, on the other hand, is so fast in reasoning that it sometimes interrupts conversations when the user pauses slightly, which occasionally makes the response feel abrupt.
In addition, Google's Notebook LM can generate conversational podcasts based on research content, winning over many users. (However, this seems unrelated to speech-to-speech, as it only generates podcast content automatically and currently does not support voice interaction with users.) Below is the podcast I generated based on the State of AI Report - 2024 Deck:
Today, Paula sent me a video introducing an open-source project called Notebook Llama: Notebook Llama can be considered the open-source version of NotebookLM. It uses large language models (LLMs) and text-to-speech (TTS) models in a step-by-step guided manner to automatically generate podcast content from PDF source files. This project provides users with an efficient conversion path from text to audio content, making knowledge transmission more convenient and diverse.
In more recent developments, Hugging Face has launched a speech-to-speech processing chain that integrates voice activity detection, speech-to-text (STT), large language models (LLMs), and text-to-speech technology, offering a more comprehensive solution for voice interaction. (I speculate they are referring to this: https://github.com/huggingface/speech-to-speech)