Cong Zhou
@CongZhou1
Followers
144
Following
818
Media
0
Statuses
91
Researcher @AnuttaconGames
Bay Area
Joined January 2020
I’m Shawn, founder of https://t.co/6SYcxgwroZ, former researcher at Meta and CS PhD at University of Cambridge. Today we’re launching https://t.co/6SYcxgwroZ: we built the world’s first Large Visual Memory Model - to give AI human-like visual memories. Why visual memory? AI to
182
369
2K
Put another way: we have LLMs with billions of parameters controlled by VAD models with thousands of parameters. There are reasons for this but we need more sophisticated solutions (and evals for them!)
Smarter voice AI turn detection is a "2025 problem." By which I mean: in 2024 all of us in the realtime, multimodal AI ecosystem spent most of our time working on relatively low-level things ... ➡️ basic turn detection using VAD ➡️ fast, reliable interruption handling ➡️
5
2
43
The first trailer for Whispers from the Star is here! 🌟 Thrilled to have contributed to the voice modeling efforts and excited for you to experience it! Join us in shaping immersive AI-driven experiences at @AnuttaconGames! 🎮🚀 https://t.co/QmEUOAamX7
anuttacon.com
We're hiring primarily in the San Francisco Bay Area, with an office in Mountain View. As a dynamic startup, we value the collaborative spirit of in-person work. We also remain open to remote...
Whispers from the Star⭐️ Announce Trailer Your words seal her fate. When a girl named Stella crash-lands on an alien planet called Gaia, you are the only person she can contact through her communicator. Through texts, voice messages, and video calls that unfold throughout your
0
0
6
Cool!
Announcing Diffusion Forcing Transformer (DFoT), our new video diffusion algorithm that generates ultra-long videos of 800+ frames. DFoT enables History Guidance, a simple add-on to any existing video diffusion models for a quality boost. Website: https://t.co/wdZ19yCgjJ (1/7)
0
0
0
It’s cool!
Excited to share a peek of what I’ve been working on We @sesame believe voice is key to unlocking a future where computers are lifelike Here’s an early preview you can try! 👇 We’ll be open sourcing a model, and yes… we’re building hardware! 🧵
0
0
0
字节跳动这个新项目效果非常不错 OmniHuman:通过一张图片配合音频或视频,生成非常自然的会说话、唱歌的人类动作视频 支持各种不同类型输入(如单一的人物图片和音频、视频等信号)生成非常逼真真人视频动画,涵盖从面部表情到全身动作,无论是说话、唱歌、跳舞等。 OmniHuman
20
162
694
@sedielem imo just because it more “compressed” doesn’t mean it’s good for “modeling.” In audio/speech space people use semantic token, which is not necessarily optimized for compression. What matters more is the characteristics of representation the encoder has learnt.
2
1
5
Congratulations, Jordi! I’ll definitely play with it, any plans to go to 32k?
Weights are out! 🤗 Tokenizing 16kHz speech at very low bitrates. Inference code: https://t.co/eZKbrBzHzw Model code: https://t.co/vLJhpyGa7M Model weights: https://t.co/fFHpte7fey arXiv: https://t.co/ZbslCfppvF Audio demos: https://t.co/J9D46A6prO
0
0
0
You can not miss this one!
come hang w/ us at neurips! i'm hosting an anime & ai social on dec 11th! will be there along with a bunch of folks who work on @nijijourney then later, we're hosting a diffusion bar event dec 12th with @midjourney! rsvp on the partiful links below!
0
0
1
Tried my best, then realize there are certain performance gaps we can’t reach at this point. 🌞 side is that tts is still not solved.
1
0
2
Transformer-based TTS models sound great but have all kinds of reliability issues. Our new model, Very Attentive Tacotron (VAT), is a Transformer-based TTS system that doesn't drop or repeat words and can generalize to any practical utterance length. https://t.co/y3kCIYF8M5
2
12
51
LibriTTS has been ranked 6th. Congrats to all authors and collaborators! And thanks to all users.
Congratulations to the SUPERB Team! Our work on the Speech Processing Universal PERformance Benchmark (SUPERB) has been ranked 7th among the most cited papers at INTERSPEECH over the past five years! A big round of applause to everyone involved.
0
7
49
Wow, this would be fun indeed
2.5 months ago @elevenlabsio put up this comparison with our 10 day old Sonic model: https://t.co/U2A5tcZC9b The team took it as a challenge, here's our new scorecard. Higher quality, cheaper & the fastest voice model period. https://t.co/44caSdm6pe Next 3 months will be fun.
0
0
2
tl;dr Adding independent gaussian noise to each pixel is equivalent to adding uniform frequency noise to a full image. Since images have a power law distribution of frequencies, adding pixel noise ~= low pass, so denoising ~= iteratively predicting frequencies from low to high.
Diffusion is the rising tide that eventually submerges all frequencies, high and low 🌊 Diffusion is the gradual decomposition into feature scales, fine and coarse 🗼 Diffusion is just spectral autoregression 🤷🌈
4
10
186
If you're attending #INTERSPEECH2024 and have an interest in audio tokens, we warmly invite you to join our presentation! #DeepLearning #Speech #LLM #audio #research #SpeechBrain #AI
📢 I'll be presenting our paper "How Should We Extract Discrete Audio Tokens from Self-Supervised Models?" at InterSpeech! 🎙️ Meet us at the Speech Processing Using Discrete Speech Units, Oral Session on Sep 3, 16:20. 🔗 Paper: https://t.co/Z98B0DFjGD
#INTERSPEECH2024
0
5
22