Language is fundamental to human interaction — but so, too, is the emotion behind it. 

Expressing happiness, sadness, anger, frustration or other feelings helps convey our messages and connect us. 

While generative AI has excelled in many other areas, it has struggled to pick up on these nuances and process the intricacies of human emotion. 

Typecast, a startup using AI to create synthetic voices and videos, says it is revolutionizing in this area with its new Cross-Speaker Emotion Transfer.

The technology allows users to apply emotions recorded from another’s voice to their own while maintaining their unique style, thus enabling faster, more efficient content creation. It is available today through Typecast’s My Voice Maker feature. 

“AI actors have yet to fully capture the emotional range of humans, which is their biggest limiting factor,” said Taesu Kim, CEO and cofounder of the Seoul, South Korea-based Neosapience and Typecast.

With the new Typecast Cross-Speaker Emotion Transfer, “anyone can use AI actors with real emotional depth based on only a small sample of their voice.” 

Decoding emotion

Although emotions usually fall within seven categories — happiness, sadness, anger, fear, surprise and disgust, based on universal facial movements — this is not enough to express the wide variety of emotions in generated speech, Kim noted. 

Speaking is not just a one-to-one mapping between given text and output speech, he pointed out.

“Humans can speak the same sentence in thousands of different ways,” he told TechForgePulse in an exclusive interview. We can also show various different emotions in the same sentence (or even the same word). 

For example, recording the sentence “How can you do this to me?” with the emotion prompt “In a sad voice, as if disappointed” would be completely different from the emotion prompt “Angry, like scolding.” 

Similarly, an emotion described in the prompt, “So sad because her father passed away but showing a smile on her face” is complicated and not easily defined in one given category. 

“Humans can speak with different emotions and this leads to rich and diverse conversations,” Kim and other researchers write in a paper on their new technology.

Emotional text-to-speech limitations

Text-to-speech technology has seen significant gains in just a short period of time, led by models ChatGPT, LaMDA, LLama, Bard, Claude and other incumbents and new entrants. 

Emotional text-to-speech has shown considerable progress, too, but it requires a large amount of labeled data that is not easily accessible, Kim explained. Capturing the subtleties of different emotions through voice recordings has been time-consuming and arduous.

Furthermore, “it is extremely hard to record multiple sentences for a long time while consistently preserving emotion,” Kim and his colleagues write. 

In traditional emotional speech synthesis, all training data must have an emotion label, he explained. These methods often require additional emotion encoding or reference audio. 

But this poses a fundamental challenge, as there must be available data for every emotion and every speaker. Furthermore, existing approaches are exposed to mislabeling problems as they have difficulty extracting intensity. 

Cross-speaker emotion transfer becomes ever more difficult when an unseen emotion is assigned to a speaker. The technology has so far performed poorly, as it is unnatural for emotional speech to be produced by a neutral speaker instead of the original speaker. Additionally, emotion intensity control is often not possible. 

“Even if it is possible to acquire an emotional speech dataset,” Kim and his fellow researchers write, “there is still a limitation in controlling emotion intensity.”

Leveraging deep neural networks, unsupervised learning

To address this problem, the researchers first input emotion labels into a generative deep neural network — what Kim called a world first. While successful, this method was not enough to express sophisticated emotions and speaking styles. 

The researchers then built an unsupervised learning algorithm that discerned speaking styles and emotions from a large database. During training, the entire model was trained without any emotion label, Kim said. 

This provided representative numbers from given speeches. While not interpretable to humans, these representations can be used in text-to-speech algorithms to express emotions existing in a database. 

The researchers further trained a perception neural network to translate natural language emotion descriptions into representations. 

“With this technology, the user doesn’t need to record hundreds or thousands of different speaking styles/emotions because it learns from a large database of various emotional voices,” said Kim. 

Adapting to voice characteristics from just snippets

The researchers achieved “transferable and controllable emotion speech synthesis” by leveraging latent representation, they write. Domain adversarial training and cycle-consistency loss disentangle the speaker from style. 

The technology learns from vast quantities of recorded human voices — via audiobooks, videos and other mediums — to analyze and understand emotional patterns, tones and inflections. 

The method successfully transfers emotion to a neutral reading-style speaker with just a handful of labeled samples, Kim explained, and emotion intensity can be controlled by an easy and intuitive scalar value.

This helps to achieve emotion transfer in a natural way without changing identity, he said. Users can record a basic snippet of their voice and apply a range of emotions and intensity, and the AI can adapt to specific voice characteristics. 

Users can select different types of emotional speech recorded by someone else and apply that style to their voice while still preserving their own unique voice identity. By recording just five minutes of their voice, they can express happiness, sadness, anger or other emotions even if they spoke in a normal tone.

Typecast’s technology has been used by Samsung Securities in South Korea (a Samsung Group subsidiary), LG Electronics in Korea and others, and the company has raised $26.8 billion since its founding in 2017. The startup is now working to apply its core technologies in speech synthesis to facial expressions, Kim said.

Controllability critical to generative AI

The media environment is a rapidly-changing one, Kim pointed out. 

In the past, text-based blogs were the most popular corporate media format. But now, short-form videos reign supreme, and companies and individuals must produce much more audio and video content, more frequently. 

“To deliver a corporate message, high-quality expressive voice is essential,” Kim said. 

Fast, affordable production is of utmost importance, he added — manual work by human actors is simply inefficient. 

“Controllability in generative AI is crucial to content creation,” said Kim. “We believe these technologies help ordinary people and companies to unleash their creative potential and improve their productivity.”

TechForgePulse's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.