ElevenLabs, a year-old voice cloning and synthesis startup founded by former Google and Palantir employees, today announced the launch of AI Dubbing, a dedicated product that can translate any speech, including long-form content, into more than 20 different languages.
Available to all platform users, the offering comes as a new way to dub audio and video content and can transform an area that has largely been manual for years.
More importantly, it can break language barriers for smaller content creators who don’t have the resources to hire manual translators to convert their content and take it global.
“We have tested and iterated this feature in collaboration with hundreds of content creators to dub their content and make it more accessible to wider audiences,” Mati Staniszewski, CEO and co-founder of ElevenLabs, told TechForgePulse. “We see huge potential for independent creatives – such as those creating video content and podcasts – all the way through to film and TV studios.”
ElevenLabs claims the feature can deliver high-quality translated audio in minutes (depending on the length of the content) while retaining the original voice of the speaker, complete with their emotions and intonation.
However, in this age of AI, when almost every enterprise is looking at language models to drive efficiencies, it is not the only one exploring speech-to-speech translation.
AI Dubbing: How it works
While AI-driven translation involves multiple layers of work, starting from noise removal to speech translation, users at the front end don’t have to go through any of those steps. They just have to select the AI Dubbing tool on ElevenLabs, create a new project, select the source and target languages and upload the file of the content.
Once the content is uploaded, the tool automatically detects the number of speakers and gets to work with a progress bar appearing on the screen. This is just like any other conversion tool on the internet. After completion, the file can be downloaded and used.
Behind the scenes, the tool works by tapping ElevenLabs’ proprietary method to remove background noise, differentiating music and noise from actual dialogue from speakers. It recognizes which speakers speak when, keeping their voices distinct, and transcribes what they say in their original language using a speech-to-text model. Then, this text is translated, adapted (so lengths match) and voiced in the target language to produce the desired speech while retaining the speaker’s original voice characteristics.
Finally, the translated speech is synced back with the music and background noise originally removed from the file, preparing the dubbed output for use. EvenLabs claims this work is the culmination of its research on voice cloning, text and audio processing and multilingual speech synthesis.
For producing the final speech from translated text, the company taps its latest Multilingual v2 model. It currently supports more than 20 languages, including Hindi, Portuguese, Spanish, Japanese, Ukrainian, Polish and Arabic, giving users a wide range of options to globalize their content.
Prior to this end-to-end interface, ElevenLabs offered separate tools for voice cloning and text-to-speech synthesis. This way, if one wanted to translate their audio content, like a podcast, into a different language, they first had to create a clone of their voice on the platform while transcribing and translating the audio separately. Then, using the translated text file and their cloned speech, they could produce audio from the text-to-speech model. Not to mention, this only worked for speech without any major background music or noise.
Staniszewski confirmed that the new dubbing feature will be available to all users of the platform, but will have some character limits, as has been the case with text-to-speech generation. Around one minute of AI Dubbing would typically equate to 3,000 characters, he said.
AI-based voices are coming
While ElevenLabs is making headlines with back-to-back developments, it is only the only one exploring AI-based voicing. A few weeks back, Microsoft-backed OpenAI made ChatGPT multimodal with the ability to have conversations in response to voice prompts, like Alexa.
Here too the company is using speech-to-text and text-to-speech models to convert audio, but the technology is not available to all.
OpenAI said it is using it with select partners to prevent misuse of the capabilities. One of these is Spotify which is using is helping its podcasters transcribe their content into different languages while retaining their own voice.
On his part, Staniszewski said ElevenLabs’ AI Dubbing tool differentiates by translating video or audio of any length, containing any number of speakers, while preserving their voice and emotions across up to 20 languages and delivering the highest quality results.
Other players are also active in the AI-powered voice and speech synthesis space, including MURF.AI, Play.ht and WellSaid Labs.
Just recently, Meta also launched SeamlessM4T, an open-source multilingual foundational model that can understand nearly 100 languages from speech or text and generate translations into either or both in real-time.
According to Market US, the global market for such tools stood at $1.2 billion in 2022 and is estimated to touch nearly $5 billion in 2032, with a CAGR of slightly above 15.40%.
TechForgePulse's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.