Beijing Academy Unveils Emu3: Next-Gen Multimodal AI Unifying Text, Images, and Video

The Beijing Academy of Artificial Intelligence (BAAI) has unveiled Emu3, a groundbreaking multimodal world model that unifies the understanding and generation of text, images, and video through next-token prediction.

“Emu3 successfully validates that next-token prediction can serve as a powerful paradigm for multimodal models, scaling beyond language models and delivering state-of-the-art performance across multimodal tasks,” said Wang Zhongyuan, director of BAAI, in a press release.

By tokenizing images, text, and videos into a discrete space, Emu3 trains a single transformer from scratch on a mixture of multimodal sequences. This approach eliminates the need for diffusion or compositional methods entirely, streamlining the process of multimodal generation and perception.

According to BAAI, Emu3 outperforms several well-established task-specific models in both generation and perception tasks. The organization has open-sourced the key technologies and models of Emu3 to the international technology community, fostering collaboration and innovation.

Technology practitioners have noted that a new opportunity has emerged to explore multimodality through a unified architecture, eliminating the need to combine complex diffusion models with large language models.

“In the future, the multimodal world model will promote scenario applications such as robot brains, autonomous driving, multimodal dialogue, and inference,” Wang said, highlighting the potential impact of Emu3 on various industries.

The launch of Emu3 marks a significant step forward in artificial intelligence, offering a unified approach to understanding and generating diverse data types. This innovation positions BAAI at the forefront of AI research and development.