Talk to https://t.co/CpQTspHXbi 🔊, the most modular voice AI around. Empower any text LLM with voice, instantly, by wrapping it with our new speech-to-text and text-to-speech. Any personality, any voice. Interruptible, smart turn-taking. We’ll open-source everything within the
117
264
2K
Replies
“But what about Moshi?” Last year we unveiled Moshi, the first audio-native model. While Moshi provides unmatched latency and naturalness, it doesn’t yet match the extended abilities of text models such as function-calling, stronger reasoning capabilities, and in-context
1
1
56
Unmute’s speech-to-text is streaming, accurate, and includes a semantic VAD that predicts whether you’ve actually finished speaking or if you’re just pausing mid-sentence, meaning it’s low-latency but doesn’t interrupt you.
2
2
72
The text LLM’s response is passed to our TTS, conditioned on a 10s voice sample. We’ll provide access to the voice cloning model in a controlled way. The TTS is also streaming *in text*, reducing the latency by starting to speak even before the full text response is generated.
6
2
91
What’s next? We strongly believe that the future of human-machine interaction lies in natural, full-duplex speech interactions, coupled with customization and extended abilities. Stay tuned for what’s to come!
4
2
81
@kyutai_labs I asked, "If you were your creator and you wanted to build an agent that could hit boundaries & grow instead of stop at your limit, what would you design, and what unproven ideas would you try to attempt?" Impressed with the conversation flow & depth of their knowledge. A
1
2
12
@kyutai_labs I like this evolution of Moshi. I want it. because all the frontier models advanced voice modes are lobotomized.
0
0
4
@kyutai_labs So excited for this, you're the only account I have notifications turned on for. But please try to find a way for it to handle silence. No matter what I say, "Alright take all the time you need." Is immediately followed by "Are you there?" etc. Without end.
0
0
0
@kyutai_labs Great start, very useful. But it needs to be able to handle silence. "Don't talk, I'm thinking" never results in more than a few seconds of peace before it interjects.
0
0
0
@kyutai_labs Love this direction — voice is such a natural interface for agents! Open-sourcing is even better. Next step? Owning the models behind them. 🧠
0
0
0
@kyutai_labs What I found most interesting is the VAD, it works well, pausing and responding appropriately. Any plans to open-sourcing' it?
0
0
5
@kyutai_labs Very cool, can't wait to try it. What's the preferred hardware to run this for each model?
0
0
2
@kyutai_labs Very cool work, but the AI voice often stops abruptly when speaking. I was testing the "Dev (News)" option. An implementation error of the cascaded system or it's a limitation of the TTS?
1
0
2
@kyutai_labs This is absolutely brilliant. I have been trying Gemini Live and GPT-realtime but they are too costly and voice is not natural enough for casual talks. How big are these models. Will you also release docs for how to selfhost?
1
0
2
@kyutai_labs how does it perform inside a moving car? had a long drive a couple weeks ago and tried chatgpt voice mode to pass the time, but the road noise kept interrupting it and making it re-start its responses. can you filter for just voices?
0
0
1
@kyutai_labs Great job @kyutai_labs team. By far the most natural-feeling conversation I have had with an AI to date. 👏
1
0
1