
Neil Zeghidour
@neilzegh
Followers
4K
Following
2K
Media
47
Statuses
472
Founder @kyutai_labs working on audio generation. Co-invented SoundStream, AudioLM, MusicLM, Moshi. Previously @GoogleDeepMind and @metaai.
Paris
Joined January 2020
Today we release Hibiki, real-time speech translation that runs on your phone. Adaptive flow without fancy policy, simple temperature sampling of a multistream audio-text LM. Very proud of @tom_labiausse 's work as an intern.
Meet Hibiki, our simultaneous speech-to-speech translation model, currently supporting 🇫🇷➡️🇬🇧. Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the
12
53
402
🚀New models: ARC-Encoders We introduce a lightweight encoder that compresses context into continuous representations for LLMs, reducing inference cost while preserving performance. Our Adaptable text Representations Compressor, named ARC-Encoder, achieves large efficiency gains
2
23
167
Introducing General Intuition and our $133.7M Seed from Khosla Ventures, General Catalyst, and Raine. We build foundation models and general agents for environments that require deep spatial and temporal reasoning.
106
97
2K
Training a VQ-VAE on Fashion MNIST, for the explainer of neural audio codecs I'm working on at @kyutai_labs
0
4
34
We released the pre-print describing the TTS and ASR behind our recent Unmute demo: https://t.co/TdfE4yw5lK 👨🔬🔬 We leverage the same architecture as in Moshi, with the text tokens being either late (ASR) or in advanced compared to audio tokens (TTS). 👇 @kyutai_labs @neilzegh
1
2
27
We're building the data infrastructure that AI actually needs. Current systems were built for humans reading dashboards. But an H100 can consume 4 million images per second. The future isn't human-scale. It's machine-scale. Introducing Spiral: Data 3.0 🌀 1/8
13
38
386
What if you could not only watch a generated video, but explore it too? 🌐 Genie 3 is our groundbreaking world model that creates interactive, playable environments from a single text prompt. From photorealistic landscapes to fantasy realms, the possibilities are endless. 🧵
835
3K
14K
I’m happy to share that I’ll be attending ICML 2025 in Vancouver next week to present 𝐇𝐢𝐛𝐢𝐤𝐢 [ https://t.co/q50LBOvM1h] 🇫🇷🇬🇧 — Kyutai’s real-time and expressive speech translation system. I'll be presenting the poster on Wednesday, July 16 at 4:30PM, feel free to stop by! 💬
2
9
61
Thrilled to finally share what we've been working on for months at @huggingface 🤝@pollenrobotics Our first robot: Reachy Mini A dream come true: cute and low priced, hackable yet easy to use, powered by open-source and the infinite community. Tiny price, small size, huge
239
528
3K
We’re releasing the top 3B model out there SOTA performances It has dual mode reasoning (with or without think) Extended long context up to 128k And it’s multilingual with strong support for en, fr, es, de, it, pt What do you need more? Oh yes we’re also open-sourcing all
15
83
478
After releasing the best streaming STT, we’re releasing our state-of-the-art TTS and the code to run Unmute yourself and host your own custom voice AIs. Enjoy!
Kyutai TTS and Unmute are now open source! The text-to-speech is natural, customizable, and fast: it can serve 32 users with a 350ms latency on a single L40S. Try it out and get started on the project page: https://t.co/B4P9FuOrQc
3
4
51
lmao just found out I trained a multilingual ASR without knowing it 😭 . This is most likely a side-effect of non-FR/EN examples slipping through our language filtering, and just makes me want to train a proper multilingual one with @n0mad_0. Thanks @TommyFalkowski!
It works! We can just inject a foreign language audio snippet at the beginning to make the kyutai stt 1B realtime model use languages other than English and French. I got it working for Spanish, German and Japanese! @kyutai_labs
5
3
36
New speech-to-text model from @kyutai_labs The best streaming model on the HF leaderboard and it runs on your Mac or iPhone with MLX:
Our latest open-source speech-to-text model just claimed 1st place among streaming models and 5th place overall on the OpenASR leaderboard 🥇🎙️ While all other models need the whole audio, ours delivers top-tier accuracy on streaming content. Open, fast, and ready for production!
1
10
117
A quick second of fame: Kyutai-STT gets into top-5 of OpenASR leaderboard --- and unlike others, our model works in a streaming regime and starts transcribing before the entire audio is available!
Kyutai Speech-To-Text is now open-source! It’s streaming, supports batched inference, and runs blazingly fast: perfect for interactive applications. Check out the details here: https://t.co/bQMP56XaKC
2
9
24
My latest work in ASR was with waveform-based CNNs back in 2018. Got back to it to ship yet another uniquely capable model based on Moshi's framework and powering https://t.co/QF9VcENGxk ! Batched, streaming inference (you typically just get 1 of the 2) is the game changer here.
Kyutai Speech-To-Text is now open-source! It’s streaming, supports batched inference, and runs blazingly fast: perfect for interactive applications. Check out the details here: https://t.co/bQMP56XaKC
2
6
46
🤗We're open-sourcing Holo1 model weights & the WebClick dataset to accelerate agentic AI research! Find them on our HuggingFace page
huggingface.co
1
6
24
There is so much more to voice interaction than assistants. Get creative with https://t.co/QF9VcEN8HM.
Using Unmute with a custom voice and prompt to create a very intense ice cream seller, inspired by Justin Kuritzkes' sketch🍦
0
2
13
Unmute is our new cascaded voice assistant: fast, accurate, and flexible. It doesn't have the full-duplex and zero latency of Moshi, but you can change the voice with a 10s sample and plug any LLM. A good playground for testing custom voice AIs.
Talk to https://t.co/CpQTspHXbi 🔊, the most modular voice AI around. Empower any text LLM with voice, instantly, by wrapping it with our new speech-to-text and text-to-speech. Any personality, any voice. Interruptible, smart turn-taking. We’ll open-source everything within the
2
9
66
We are releasing a multilingual 2B model focused on UE languages, but maybe more importantly we're making our pretraining data pipeline public. Just run `uv run dactory create /path/to/dest` and get Helium's pretraining data.
🚀 Thrilled to announce Helium 1, our new 2B-parameter LLM, now available alongside dactory, an open-source pipeline to reproduce its training dataset covering all 24 EU official languages. Helium sets new standards within its size class on European languages!
0
12
97
Our team is hiring a postdoc in Audio AI! What: speech, music, bioacoustics How: multiresolution neural networks in the raw waveform Where: Nantes, France ( https://t.co/tbNcztQDWh) When: negotiable How long: 12 months, renewable Apply before May 10: https://t.co/YuRvlVL4SF
1
19
53
Thanks @GoogleAI 🙏, I'm proud to see concepts introduced in this paper (RVQ-VAE, quantizer dropout) being still as relevant four years later, and in particular how the RVQ turned out to be a perfect fit for audio language models.
Congratulations to Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi for winning the IEEE Best Paper Award for "SoundStream: An End-to-End Natural Audio Codec"! https://t.co/5C9HuwYFjJ
#SPSAwards #IEEEAwards
3
13
184