Meet SeamlessM4T, the Meta AI model that can translate 100 languages into speech or text

As part of its broader effort to remove language barriers and keep people connected, Meta has developed a multilingual foundational model that can understand nearly 100 languages from speech or text and generate translations into either or both in real time.

Officially dubbed SeamlessM4T, the multimodal technology has been publicly released to help researchers build on the development and introduce universal applications capable of delivering speech-to-speech, speech-to-text, text-to-speech and text-to-text translations. It has been made available along with SeamlessAlign, a multimodal translation dataset totaling 265,000 hours of mined speech and text alignments.

The offering marks a significant development in AI’s application in linguistics given that it’s a single system performing multiple tasks across speech and text. Prior to this, the approach largely involved different systems for different tasks, such as a dedicated system for speech-to-speech translations.

What can SeamlessM4T do?

As Meta explains, SeamlessM4T implicitly recognizes the source language without the need for a separate language identification model. It can detect speech and text in nearly 100 languages and produce text in nearly as many and speech in 36 languages. More interestingly, it can also figure out when more than one language has been mixed in the same sentence and provide translations in a single targeted language (like a sentence spoken in Telugu and Hindi and translated into English speech).

When tested with BLASER 2.0, which allows for evaluation across speech and text units, the model performed better against background noises and speaker variations in speech-to-text tasks (with average improvements of 37% and 48%, respectively) compared to the current state-of-the-art models for speech-to-text tasks.

“SeamlessM4T outperforms previous state-of-the-art competitors,” Meta said in a blog post. “We also significantly improve performance for low and mid-resource languages (with smaller digital footprint) supported, and maintain strong performance on high-resource languages (like English).”

When developed, this can lead to large-scale universal translation systems, allowing people who speak different languages to communicate more effectively.

Notably, Google is also working in this direction and has announced Universal Speech Model (USM), which can perform automatic speech recognition (ASR) for both widely-spoken and under-resourced languages.

How it all works?

To bring the model to life, Meta mined web data (tens of billions of sentences) and speech (4 million hours) from public sources and aligned them to create the SeamlessAlign dataset. In total, the company said it was able to align more than 443,000 hours of speech with texts and create about 29,000 hours of speech-to-speech alignments. Using this data, the company trained the multitask UnitY model to produce the desired multimodal outcomes.

“The multitask UnitY model consists of three main sequential components,” Meta explains. “Text and speech encoders have the task of recognizing inputs in nearly 100 languages. The text decoder then transfers that meaning into nearly 100 languages for text, followed by a text-to-unit model to decode into discrete acoustic units for 36 speech languages…The decoded discrete units are then converted into speech using a multilingual HiFi-GAN unit vocoder.”

Not perfect yet

That said, it is important to note that SeamlessM4T is far from perfect right now. Evaluations found that the model has both added toxicity (although 63% less than state-of-the-art models) and gender bias issues.

According to a whitepaper detailing the technology, SeamlessM4T overgeneralizes to masculine forms when translating from neutral terms (with an average preference of approximately 10%) while showing a lack of robustness when varying gender by an amount of about 3%.

“We detect toxicity in both the input and the output for the demo,” Meta said. “If toxicity is only detected in the output, it means that toxicity is added. In this case, we include a warning and do not show the output…Regarding bias, we have started our efforts on evaluating gender bias in languages at scale. We are now able to quantify gender bias in dozens of speech translation directions by extending to speech our previously designed Multilingual HolisticBias dataset.”

The company emphasized that this is an ongoing effort, and that it will continue to research and take action in these areas to further improve the robustness and safety of the SeamlessM4T model.

TechForgePulse's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.