Kyutai Releases Moshi, an Open Source Voice Model Ahead of OpenAI 

6 months ago 54

Most woke users on X got no chill. When OpenAI released o1, one of them queried if they would be launching voice features soon. “How about a couple of weeks of gratitude for magic intelligence in the sky, and then you can have more toys soon?” replied Sam Altman, with a tinge of sarcasm. 

A couple of weeks later, Kyutai, a French non-profit AI research laboratory, has come to the rescue with Moshi, a real-time native multimodal foundational AI model capable of conversing with humans in real time, much like what OpenAI’s advanced model was intended to do. 

“Are you slowly losing faith in the objective reality and existence of Advanced Voice Mode? Talk to Moshi instead,” posted former OpenAI co-founder Andrej Karpathy on X.

We release two Moshi models, adapted from our demo by replacing Moshi’s voice with artificially generated ones, one male and one female. We are looking forward to hearing what the community will build with it, and we thank everyone that helped for this release: @HuggingFace:… pic.twitter.com/jVfk4rE2p9

— kyutai (@kyutai_labs) September 18, 2024

The standout feature of Moshi is its open-source nature, which allows it to run locally on Apple MacBooks. Kyutai Labs has launched three models: Moshi, and it’s fine-tuned variants for male (Moshiko) and female (Moshika) synthetic voices, as well as the speech codec Mimi. These models are available for PyTorch, MLX on macOS, and Rust implementations.

On Apple Mac, the model functions as a conversational agent for casual dialogue, basic information, and advice—including recipes and trivia—as well as roleplay. While it facilitates smooth, low-latency interactions, it has limited capabilities for complex tasks and does not support tool integration, according to the company.

The company said that they have tested the MLX version on a MacBook Pro M3. Currently, quantisation is not supported for the PyTorch version, so users will need a GPU with at least 24GB of memory. For those using the Rust backend, ensure the latest version of the Rust toolchain is installed. To support GPU functionality, users must have CUDA properly installed, including the nvcc compiler.

Overall, it provides a compelling alternative to OpenAI’s Voice model, which is costly and does not offer local deployment options. Nonetheless, it still has room for improvement. 

“I find the Moshi model personality to be very amusing: it is a bit abrupt, it interrupts, it is a bit rude but somehow in a kind of endearing way, it goes off on tangents, it goes silent for no reason sometimes, so it’s all a bit confusing but also very funny and meme-worthy, quipped Karpathy. 

Sharing a similar experience to Karpathy, Elvis Saravia, co-founder of Dair.AI, said, “Moshi is a bit abrupt, interrupts frequently, and sometimes ignores questions in the conversation. I almost lost my patience during the brief interaction I had with it. There’s a lot of work to be done, but it’s exciting to see the open-source artifacts released.”

Karpathy shared his excitement about using this voice interaction on his MacBook, calling it “cool.” He pointed out that the repository and a detailed paper are accessible on GitHub. “I’m looking forward to engaging with our computers in a seamless, end-to-end way, avoiding intermediate text representations that often strip away important information,” he said.

How Moshi Works?

Moshi comprises three key elements: Helium, a 7 billion-parameter language model trained on 2.1 trillion tokens; Mimi, a neural audio codec that captures both semantic and acoustic details; and a novel multi-stream architecture that processes audio from both the user and Moshi on separate channels.

It operates as a full-duplex framework, enabling the processing of two audio streams simultaneously—one from the user and one generated by Moshi. It uses Mimi to compress audio input and reduce latency, achieving an overall latency as low as 200 milliseconds on L4 GPUs.

Mimi handles audio at 24 kHz, compressing it to a 12.5 Hz representation with a bandwidth of just 1.1 kbps. This is achieved with a latency of 80 ms per frame, outperforming existing non-streaming codecs like SpeechTokenizer (50 Hz, 4 kbps) and SemantiCodec (50 Hz, 1.3 kbps).

Moshi’s architecture also features a small Depth Transformer to handle inter-codebook dependencies and a large Temporal Transformer to model temporal dependencies. This setup supports text token generation alongside audio streams, improving the overall quality of dialogue generation.

During inference, the user’s audio stream is captured from the input, while Moshi’s audio stream is sampled from the model’s output. By predicting text tokens related to its own speech and internal monologue, Moshi greatly improves the accuracy and quality of its responses.

Moshi is Not Alone

Hume AI  recently  introduced EVI 2, a new foundational voice-to-voice AI model that promises to enhance human-like interactions. Available in beta, EVI 2 can engage in rapid, fluent conversations with users, interpreting tone and adapting its responses accordingly. The model supports a variety of personalities, accents, and speaking styles and includes multilingual capabilities. 

Meanwhile, Amazon Alexa is partnering with Anthropic to improve its conversational abilities, making interactions more natural and human-like. Earlier this year, Google launched Astra, an ‘universal AI agent’ built on the Gemini family of AI models. Astra features multimodal processing, enabling it to understand and respond to text, audio, video, and visual inputs simultaneously.

Read Entire Article