Meta has recently introduced a new AI model named Spirit LM, which is designed to generate more expressive voices that can reflect a range of emotions such as anger, surprise, and happiness. This model is part of Meta’s broader initiative to advance machine intelligence through its research lab, Meta FAIR. Spirit LM is a multimodal model that can mix text and speech inputs and outputs, making it versatile for various applications. The model is open-source, allowing developers and researchers to explore and utilize its capabilities in different projects
Multimodal AI chatbots are rapidly becoming a major trend, with a steady stream of new models emerging on platforms like GitHub. Meta AI, embracing its commitment to open-source development, has now introduced the Spirit LM model to tackle some of the existing challenges in multimodal technology. Early impressions suggest it’s a significant step forward.
Currently, ChatGPT’s Advanced Voice Mode allows for dynamic, expressive, and human-like responses, as highlighted by viral videos of the chatbot delivering flirtatious replies with remarkable finesse. However, the technology has yet to reach the level many users expect, even though it already surpasses competitors like Gemini Live in some areas.
In the background, Meta has been closely monitoring the landscape, and with the launch of Spirit LM, the company aims to elevate the quality of AI-generated speech, making it sound more natural than ever.
Meta explains that Spirit LM is built on a “7B pretrained text language model,” and highlights a crucial difference in approach. Many current multimodal AI models use Automatic Speech Recognition (ASR) to convert voice inputs into text. While this is effective, Meta points out that the process can strip away much of the expressive nuance in spoken language. To address this, Spirit LM takes a different approach, aiming to preserve the richness of expression in voice interactions.
For more details, check out Meta’s announcement here.