Amazing! Voice-to-Voice AI: A Breakthrough Achievement (sub 200ms latency)
Version 1.00: Original document posted, this document s had some inaccuracies based on what I understood at the time, these are updated and corrected current version.
Version 1.01: 08/07/2024: Updated this post with edits and corrections to current information.
The team at Kyutai has launched the first real-time voice-to-voice AI model. This model processes voice inputs and outputs at blisteringly fast speeds, I was blown away by their demo. This is a milestone for the AI community and shows how the AI space is still growing, My feeling is that we are going to see a lot more use cases for AI as this space matures.
The accomplishments of this team are truly remarkable. They have trained the voice-to-voice model on audio data to predict subsequent audio segments. By eliminating the need for conversion from audio to text and back to audio, they have created a model that is exceptionally fast due to the absence of conversion steps. This, coupled with 8-bit quantization of the model and full duplex audio streams between the user and the backend software, coupled with point-of-view digital signal audio processing using Webassembly, makes for the first AI voice-to-voice that feels natural.
Its speed is astonishing, making it the first instance where I was genuinely convinced I was interacting with a human. This illusion was not solely due to audio quality but also because of the model's ability to seamlessly handle interruptions in conversation, as humans naturally do. Although it is not perfect and remains a demo model at this point, it compellingly demonstrates the potential of this technology. During my interaction with the demo, I was particularly impressed by the model's speed and the fluidity of its responses. Let's look at the technology used:
Audio Quality at the Source: Audio is processed locally on your device (PC, Laptop, Mobile) using the browser's Web-API with AudioWorkletProcessor and AudioWorklets. AudioWorklets handle encoding and decoding: the encoder processes audio from your microphone, utilizing JavaScript to load WebAssembly for encoding. This WebAssembly uses OGG-OPUS to encode the audio for upstream transmission via WebSocket to remote server components. OGG-OPUS is a lossy format by Xiph.Org Foundation, designed for efficient, low-latency, real-time audio communication, suitable for low-end processors. Audio streams back from the server over the same WebSocket, is decoded using OGG-OPUS, and for speed, it, the encoder is again like the encoder compiled to webassembly. The decoder, like the encoder, is implemented in C/C++ with Emscripten, with performance improvements compiled to WebAssembly and configured through JavaScript.
Bidirectional Fluid Conversations: The model supports natural, bidirectional conversations by maintaining two separate audio streams—one for incoming and one for outgoing audio. This setup, potentially featuring audio cancellation techniques to isolate the speaker's voice, enables a seamless conversational flow.
Minimized Latency: To reduce latency, the model employs audio compression techniques both from the web assembly to the server and back. This compression, combined with the full-duplex capability of managing two audio streams, results in a highly responsive and user-friendly experience. From the information I have, they are using Rust on the backend, and this helps with minimizing latency.
You can visit their company website in Kyutai, you will find the demo available at the bottom of their home page.