Cong Zhou @CongZhou1 X Profile

Cong Zhou

@CongZhou1

Followers

144

Following

818

Media

0

Statuses

91

Researcher @AnuttaconGames

Bay Area

Joined January 2020

Don't wanna be here? Send us removal request.

Shawn Shen

@shawnshenjx

4 months

I’m Shawn, founder of https://t.co/6SYcxgwroZ, former researcher at Meta and CS PhD at University of Cambridge. Today we’re launching https://t.co/6SYcxgwroZ: we built the world’s first Large Visual Memory Model - to give AI human-like visual memories. Why visual memory? AI to

182

369

2K

Cong Zhou

@CongZhou1

5 months

Congrats on the release!

cory

@Cixelyn

5 months

new anime video model, with a matching banger OP. team really cooked with this one

1

0

1

Wan

@Alibaba_Wan

7 months

1/3 🚀Thrilled to introduce Wan2.1-FLF2V-14B - our first 14B-parameter large model for First-Last-Frame to video generation! Open-source, open-source, open-source! Empowering digital artists with unprecedented efficiency and creative flexibility. #wan #AIGC #alart

45

290

2K

Justin Uberti

@juberti

8 months

Put another way: we have LLMs with billions of parameters controlled by VAD models with thousands of parameters. There are reasons for this but we need more sophisticated solutions (and evals for them!)

kwindla

@kwindla

8 months

Smarter voice AI turn detection is a "2025 problem." By which I mean: in 2024 all of us in the realtime, multimodal AI ecosystem spent most of our time working on relatively low-level things ... ➡️ basic turn detection using VAD ➡️ fast, reliable interruption handling ➡️

5

2

43

Cong Zhou

@CongZhou1

8 months

The first trailer for Whispers from the Star is here! 🌟 Thrilled to have contributed to the voice modeling efforts and excited for you to experience it! Join us in shaping immersive AI-driven experiences at @AnuttaconGames! 🎮🚀 https://t.co/QmEUOAamX7

anuttacon.com

We're hiring primarily in the San Francisco Bay Area, with an office in Mountain View. As a dynamic startup, we value the collaborative spirit of in-person work. We also remain open to remote...

Whispers from the Star

@WFTS_Game

8 months

Whispers from the Star⭐️ Announce Trailer Your words seal her fate. When a girl named Stella crash-lands on an alien planet called Gaia, you are the only person she can contact through her communicator. Through texts, voice messages, and video calls that unfold throughout your

0

6

Cong Zhou

@CongZhou1

8 months

Cool!

Boyuan Chen

@BoyuanChen0

9 months

Announcing Diffusion Forcing Transformer (DFoT), our new video diffusion algorithm that generates ultra-long videos of 800+ frames. DFoT enables History Guidance, a simple add-on to any existing video diffusion models for a quality boost. Website: https://t.co/wdZ19yCgjJ (1/7)

0

Cong Zhou

@CongZhou1

9 months

It’s cool!

Justin Alvey

@justLV

9 months

Excited to share a peek of what I’ve been working on We @sesame believe voice is key to unlocking a future where computers are lifelike Here’s an early preview you can try! 👇 We’ll be open sourcing a model, and yes… we’re building hardware! 🧵

0

小互

@imxiaohu

9 months

字节跳动这个新项目效果非常不错 OmniHuman：通过一张图片配合音频或视频，生成非常自然的会说话、唱歌的人类动作视频支持各种不同类型输入（如单一的人物图片和音频、视频等信号）生成非常逼真真人视频动画，涵盖从面部表情到全身动作，无论是说话、唱歌、跳舞等。 OmniHuman

20

162

694

최형석 (Hyeong-Seok Choi)

@92HsChoi

10 months

@sedielem imo just because it more “compressed” doesn’t mean it’s good for “modeling.” In audio/speech space people use semantic token, which is not necessarily optimized for compression. What matters more is the characteristics of representation the encoder has learnt.

2

1

5

Cong Zhou

@CongZhou1

10 months

Congratulations, Jordi! I’ll definitely play with it, any plans to go to 32k?

Jordi Pons

@jordiponsdotme

10 months

Weights are out! 🤗 Tokenizing 16kHz speech at very low bitrates. Inference code: https://t.co/eZKbrBzHzw Model code: https://t.co/vLJhpyGa7M Model weights: https://t.co/fFHpte7fey arXiv: https://t.co/ZbslCfppvF Audio demos: https://t.co/J9D46A6prO

0

Cong Zhou

@CongZhou1

11 months

You can not miss this one!

cory

@Cixelyn

11 months

come hang w/ us at neurips! i'm hosting an anime & ai social on dec 11th! will be there along with a bunch of folks who work on @nijijourney then later, we're hosting a diffusion bar event dec 12th with @midjourney! rsvp on the partiful links below!

0

1

Cong Zhou

@CongZhou1

1 year

Congrats on the poc!

Justin Uberti

@juberti

1 year

In 5 months, Ultravox has gone from a v0.1 proof-of-concept to the leading open-source speech LLM!

0

Cong Zhou

@CongZhou1

1 year

Tried my best, then realize there are certain performance gaps we can’t reach at this point. 🌞 side is that tts is still not solved.

1

0

2

Eric Battenberg

@EricBattenberg

1 year

Transformer-based TTS models sound great but have all kinds of reliability issues. Our new model, Very Attentive Tacotron (VAT), is a Transformer-based TTS system that doesn't drop or repeat words and can generalize to any practical utterance length. https://t.co/y3kCIYF8M5

2

12

51

Cong Zhou

@CongZhou1

1 year

This is damn natural! Imagine expanding to drama/movie scripts, it’s not that far

Olivia Moore

@omooretweets

1 year

The NotebookLM hosts realizing they are AI and spiraling out is a twist I did not see coming

0

Heiga Zen (全炳河)

@heiga_zen

1 year

LibriTTS has been ranked 6th. Congrats to all authors and collaborators! And thanks to all users.

Hung-yi Lee (李宏毅)

@HungyiLee2

1 year

Congratulations to the SUPERB Team! Our work on the Speech Processing Universal PERformance Benchmark (SUPERB) has been ranked 7th among the most cited papers at INTERSPEECH over the past five years! A big round of applause to everyone involved.

0

7

49

Cong Zhou

@CongZhou1

1 year

Wow, this would be fun indeed

Karan Goel

@krandiash

1 year

2.5 months ago @elevenlabsio put up this comparison with our 10 day old Sonic model: https://t.co/U2A5tcZC9b The team took it as a challenge, here's our new scorecard. Higher quality, cheaper & the fastest voice model period. https://t.co/44caSdm6pe Next 3 months will be fun.

0

2

Jesse Engel

@jesseengel

1 year

tl;dr Adding independent gaussian noise to each pixel is equivalent to adding uniform frequency noise to a full image. Since images have a power law distribution of frequencies, adding pixel noise ~= low pass, so denoising ~= iteratively predicting frequencies from low to high.

Sander Dieleman

@sedielem

1 year

Diffusion is the rising tide that eventually submerges all frequencies, high and low 🌊 Diffusion is the gradual decomposition into feature scales, fine and coarse 🗼 Diffusion is just spectral autoregression 🤷🌈

4

10

186

Mirco Ravanelli

@mirco_ravanelli

1 year

If you're attending #INTERSPEECH2024 and have an interest in audio tokens, we warmly invite you to join our presentation! #DeepLearning #Speech #LLM #audio #research #SpeechBrain #AI

Pooneh Mousavi

@MousaviPooneh

1 year

📢 I'll be presenting our paper "How Should We Extract Discrete Audio Tokens from Self-Supervised Models?" at InterSpeech! 🎙️ Meet us at the Speech Processing Using Discrete Speech Units, Oral Session on Sep 3, 16:20. 🔗 Paper: https://t.co/Z98B0DFjGD #INTERSPEECH2024

0

5

22

北火

@beihuo

1 year

我想分享一个我知道的数据：notion 的用户，写文档的人数百分比，是个位数。连 notion 团队的人也表示他们也不知道怎么在 AI 方向继续推进。宝玉的观点完全正确。现在可能存在的困境是投入产出比。

宝玉

@dotey

1 year

大厂与其山寨 Cursor，不如做个好用的 AI 邮件客户端大厂抄 Cursor，这样追在别人屁股后面跑是没有前途的，AI 代码编辑器已经是红海了，就算大厂又如何，微软比 Cursor 大多少？结果 GitHub Copilot 也没打过 Cursor，大厂还不如多投资几家像 Cursor 这样的公司，为什么非要抄他们呢！

11

34

215