Google DeepMind quietly revealed a significant advancement in their artificial intelligence (AI) research on Tuesday, presenting a new autoregressive model aimed at improving the understanding of long video inputs.

The new model, named “Mirasol3B,” demonstrates a groundbreaking approach to multimodal learning, processing audio, video, and text data in a more integrated and efficient manner.

According to Isaac Noble, a software engineer at Google Research, and Anelia Angelova, a research scientist at Google DeepMind, who co-wrote a lengthy blog post about their research, the challenge of building multimodal models lies in the heterogeneity of the modalities.

“Some of the modalities might be well synchronized in time (e.g., audio, video) but not aligned with text,” they explain. “Furthermore, the large volume of data in video and audio signals is much larger than that in text, so when combining them in multimodal models, video and audio often cannot be fully consumed and need to be disproportionately compressed. This problem is exacerbated for longer video inputs.”

A new approach to multimodal learning

In response to this complexity, Google’s Mirasol3B model decouples multimodal modeling into separate focused autoregressive models, processing inputs according to the characteristics of the modalities. 

 “Our model consists of an autoregressive component for the time-synchronized modalities (audio and video) and a separate autoregressive component for modalities that are not necessarily time-aligned but are still sequential, e.g., text inputs, such as a title or description,” Noble and Angelova explain.

The announcement comes at a time when the tech industry is striving to harness the power of AI to analyze and understand vast amounts of data across different formats. Google’s Mirasol3B represents a significant step forward in this endeavor, opening up new possibilities for applications such as video question answering and long video quality assurance.

credit: google research

Potential applications for YouTube

One of the possible applications of the model that Google might explore is to use it on YouTube, which is the world’s largest online video platform and one of the company’s main sources of revenue.

The model could theoretically be used to enhance the user experience and engagement by providing more multimodal features and functionalities, such as generating captions and summaries for videos, answering questions and providing feedback, creating personalized recommendations and advertisements, and enabling users to create and edit their own videos using multimodal inputs and outputs.

For example, the model could generate captions and summaries for videos based on both the visual and audio content, and allow users to search and filter videos by keywords, topics, or sentiments. This could improve the accessibility and discoverability of the videos, and help users find the content they are looking for more easily and quickly.

The model could also theoretically be used to answer questions and provide feedback for users based on the video content, such as explaining the meaning of a term, providing additional information or resources, or suggesting related videos or playlists.

A mixed reaction from the AI community

The announcement has generated a lot of interest and excitement in the artificial intelligence community, as well as some skepticism and criticism. Some experts have praised the model for its versatility and scalability, and expressed their hopes for its potential applications in various domains.

For instance, Leo Tronchon, an ML research engineer at Hugging Face, tweeted: “Very interesting to see models like Mirasol incorporating more modalities. There aren’t many strong models in the open using both audio and video yet. It would be really useful to have it on [Hugging Face].”

Gautam Sharda, student of computer science at the University of Iowa, tweeted: “Seems like there’s no code, model weights, training data, or even an API. Why not? I’d love to see them actually release something beyond just a research paper ?.”

A significant milestone for the future of AI

The announcement marks a significant milestone in the field of artificial intelligence and machine learning, and demonstrates Google’s ambition and leadership in developing cutting-edge technologies that can enhance and transform human lives.

However, it also poses a challenge and opportunity for the researchers, developers, regulators, and users of AI, who need to ensure that the model and its applications are aligned with the ethical, social, and environmental values and standards of the society.

As the world becomes more multimodal and interconnected, it is essential to foster a culture of collaboration, innovation, and responsibility among the stakeholders and the public, and to create a more inclusive and diverse AI ecosystem that can benefit everyone.

TechForgePulse's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.