Desh Raj
@rdesh26
Followers
3K
Following
7K
Media
146
Statuses
2K
Speech + LLMs @Meta MSL | Previously: @jhuclsp, @IITGuwahati
New York, NY
Joined September 2009
**Dissertation now available** 📜: https://t.co/wpXl7D3rF7 📽️: https://t.co/tJkC9j7YiY ⏯️: https://t.co/2oDan50NyB It's a 332-page tome, but I have summarized it in this thread 👇 1/n
3
9
104
Get off the hamster wheel if you can... life is too short.
8
7
167
My favorite talk of the day was a masterclass by @fakufakurevenge on "Generative models for universal speech enhancement". Working on ASR during my PhD and now on speech LLMs, I often tend to equate "audio" with "spoken language". So it was really nice to get out of this bubble
1
2
10
A few days ago, I conducted the following poll: For the same parameter size, network architecture, and training data, which of the following models do you think would perform best at streaming ASR? With 57% of the votes, RNN-T was the winner. This is not surprising, since most
For the same parameter size, network architecture, and training data, which of the following models do you think would perform best at streaming ASR?
1
0
13
While I enjoyed the Thinky post, I would also like to read blogs that are not quite so polished, by RL researchers who are not quite so well-funded. Who are your favorite PhD student bloggers out there?
0
0
2
"Yet, after all, why not? Why shouldn't I keep it?" - Bilbo Baggins (late Third Age, Middle-earth)
0
0
2
Sunday thoughts 🤔 Minimum word error rate (WER) training is RLVR with reward = negative WER, and posterior renormalization for importance sampling. Main difference: MWER uses deterministic decoding for lower variance, while RLVR uses stochastic sampling to select candidates.
0
0
6
For the same parameter size, network architecture, and training data, which of the following models do you think would perform best at streaming ASR?
2
0
2
Made some notes 📝 while reading the RLHF book by @natolambert: > https://t.co/sWfLmqgKX5 The most fun part was to derive several policy gradient methods from the basic principle of expected reward maximization.
0
0
6
My brilliant friend @esalesk already did this a few years ago in this paper:
@thawani_avijit Haha. I am afraid people interpreted my “delete tokenizer” as “use bytes directly without BPE”, the issue is you *still* need bytes encoding arbitrariness even for that! Pixels is the only way. Just like humans. It is written. If GPT-10 uses utf8 at the input I will eat a shoe.
0
0
11
I was impacted by FAIR layoff this time. I'm looking for a new position on speech, multimodal, 3D human motion, and social behavior modeling. Happy to chat more details:)
17
44
250
I always thought the decline in fundamental AI research funding would happen because AI didn’t generate enough value to be worth the cost. But it seems like it’s happening because it generated too much value. And the race to capture that value is taking priority. Just
34
37
374
I was impacted by Meta layoffs today. As a Research Scientist working on LLM posttraining (reward models, DPO/GRPO) & automated evaluation pipelines, I’ve focused on understanding why/wehere models fail & how to make them better. I’m looking for opportunities; please reach out!
123
235
3K
I often muse that if I were to do college all over again, I would study physics.
We’ve found a ton of value hiring folks with strong theory backgrounds with little to no production ML experience. One of our members of technical staff got his phd in pure math/the geometry of black holes and had no prior ML experience. Within days of hiring him we released our
0
0
2
I was really curious to try the new Atlas browser, but it felt --- meh? I asked the integrated ChatGPT to change some browser settings but it just gave me instructions for how to do it. Maybe I am missing something, but why could this not have been a simple Chrome extension?
0
0
4
Remember when papers were "published" and models were "released"? Why is everything just "dropped" nowadays?
1
0
6
New work on reasoning with SpeechLLMs, led by @yijenshih. We explore the paradigm of "thinking while listening" in the context of full-duplex speech models!
Can SpeechLLMs Think while Listening? tl;dr We enhance SpeechLLM reasoning using a text-based CoT guided by an entropy metric. This approach, combined with preference tuning, enables "thinking while listening" with an improved accuracy-latency trade-off and greater control.
0
1
12
📢 Introducing VERSA: our new open-source toolkit for speech & audio evaluation! - 80+ metrics in one unified interface - Flexible input support - Distributed evaluation with Slurm - ESPnet compatible Check out the details https://t.co/lKpIaJE6Er
https://t.co/pSzWd5C5YM
2
32
140
📣 You can now have a conversation with Meta AI using voice. It’s super fast, connected to the web, natural and conversational and even comes with celebrity voice options from Awkwafina, Kristen Bell, John Cena, and more. What voice speaks to you? (pun intended 😆)
94
136
2K