Open source AI voice cloning arrives with MyShell OpenVoice

Startups including the increasingly well-known ElevenLabs have raised millions of dollars to develop their own proprietary algorithms and AI software for making voice clones — audio programs that mimic the voices of users.

But along comes a new solution, OpenVoice, developed by researchers at the Massachusetts Institute of Technology (MIT), Tsinghua University in Beijing, China, and members of AI startup MyShell, to offer open-source voice cloning that is nearly instantaneous and offers granular controls not found on other voice cloning platforms.

“Clone voices with unparalleled precision, with granular control of tone, from emotion to accent, rhythm, pauses, and intonation, using just a small audio clip,” wrote MyShell on a post today on its official company account on X.

Today, we proudly open source our OpenVoice algorithm, embracing our core ethos – AI for all.

Experience it now: https://t.co/zHJpeVpX3t. Clone voices with unparalleled precision, with granular control of tone, from emotion to accent, rhythm, pauses, and intonation, using just a… pic.twitter.com/RwmYajpxOt
— MyShell (@myshell_ai) January 2, 2024

The company also included a link to its pre-reviewed research paper describing how it developed OpenVoice, and links to several places where users can access and try it out, including the MyShell web app interface (which requires a user account to access) and HuggingFace (which can be accessed publicly without an account).

Reached by TechForgePulse via email, one of the lead researchers, Zengyi Qin of MIT and MyShell, wrote to say: “MyShell wants to benefit the whole research community. OpenVoice is just a start. In the future, we will even provide grants & dataset & computing power to support the open-source research community. The core echo of MyShell is ‘AI for All.'”

As for why MyShell began with an open source voice cloning AI model, Qin wrote: “Language, Vision and Voice are 3 principal modalities of the future Artificial General Intelligence (AGI). In the research field, although the language and vision already have some good open-source models, it still lacks a good model for voice, especially for a power instant voice cloning model that allows everyone to customize the generated voice. So, we decided to do this.””

Using OpenVoice

In my unscientific tests of the new voice cloning model on HuggingFace, I was able to generate a relatively convincing — if somewhat robotic sounding — clone of my own voice rapidly, within seconds, using completely random speech.

Unlike other voice cloning apps, I was not forced to read a specific chunk of text in order for OpenVoice to clone my voice. I simply spoke extemporaneously for a few seconds, and the model generated a voice clone that I could play back nearly immediately, reading the text prompt I provided.

I also was able to adjust the “style,” between several defaults — cheerful, sad, friendly, angry, etc. — using a dropdown menu, and heard the noticeable change in tone to match these different emotions.

Here’s a sample of my voice clone made by OpenVoice through HuggingFace set to the “friendly” style tone.

How OpenVoice was made

In their scientific paper, the four named creators of OpenVoice — Qin, Wenliang Zhao and Xumin Yu of Tsinghua University, and Xin Sun of MyShell — describe their approach to creating the voice cloning AI.

OpenVoice comprises two different AI models: a text-to-speech (TTS) model and a “tone converter.”

The first model controls “the style parameters and languages,” and was trained on 30,000 sentences of “audio samples from two English speakers (American and British accents), one Chinese speaker and one Japanese speaker,” each labeled according to the emotion being expressed in them. It also learned intonation, rhythm, and pauses from these clips.

Meanwhile, the tone converter model was trained on more than 300,000 audio samples from more than 20,000 different speakers.

In both cases, the audio of human speech was converted into phonemes — specific sounds differentiating words from one another — and represented by vector embeddings.

By using a “base speaker,” for the TTS model, and then combining it with the tone derived from a user’s provided recorded audio, the two models together can reproduce the user’s voice, as well as change their “tone color,” or the emotional expression of the text being spoken. Here’s a diagram included in the OpenVoice team’s paper illustrating how these two models work together:

The team notes their approach is conceptually quite simple. Still, it works well and can clone voices using dramatically fewer compute resources than other methods, including Meta’s rival AI voice cloning model Voicebox.

“We wanted to develop the most flexible instant voice cloning model to date,” Qin noted in an email to TechForgePulse. “Flexibility here means flexible control over styles/emotions/accent etc, and can adapt to any language. Nobody could do this before, because it is too difficult. I lead a group of experienced AI scientists and spent several months to figure out the solution. We found that there is a very elegant way to decouple the difficult task into some doable subtasks to achieve what seems to be too difficult as a whole. The decoupled pipeline turns out to be very effective but also very simple.”

Who’s behind OpenVoice?

MyShell, founded in 2023 with a $5.6 million seed round led by INCE Capital with additional investment from Folius Ventures, Hashkey Capital, SevenX Ventures, TSVC, and OP Crypto, already counts over 400,000 users, according to The Saas News. I observed more than 61,000 users on its Discord server when I checked earlier while writing this piece.

The startup describes itself as a “decentralized and comprehensive platform for discovering, creating, and staking AI-native apps.”

In addition to offering OpenVoice, the company’s web app includes a host of different text-based AI characters and bots with different “personalities” — similar to Character.AI — including some NSFW ones. It also includes an animated GIF maker and user-generated text-based RPGs, some featuring copyrighted properties such as the Harry Potter and Marvel franchises.

How does MyShell plan to make any money if it is making OpenVoice open source? The company charges a monthly subscription for users of its web app, as well as for third-party bot creators who wish to promote their products within the app. It also charges for AI training data.

Correction: Thursday, January 4, 2023 – Piece was updated to remove an incorrect report stating MyShell is based in Calgary, AB, Canada.

TechForgePulse's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.