Siddharth Singh
@siddharth_3773
Followers
859
Following
768
Media
1
Statuses
55
Research Scientist @nvidia, CS Ph.D from UMD. Building massively parallel inference systems.
Santa Clara, CA
Joined June 2024
I'll be joining Nvidia's Applied Deep Learning Research team ( https://t.co/obch93B6N8) from May of next year. Excited to kickstart my industry career with the immensely cracked systems team that works on Megatron-LM!
45
38
1K
There's been a lot of discussion recently about parallel vs sequential reasoning. The recurrent models we trained this year are sequential, which makes them good at math, but slow (see pic) However, if you squint, models with recurrent-depth/loops are like diffusion models ...
3
16
70
Scenes when an LLM invests in VOO and outperforms everyone 😂
0
0
1
Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats
26
91
668
Glad we could together improve the scientific discourse around reasoning. Was great to see the authors reach out and incorporate all our feedback!
1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing
1
5
24
1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
1
15
52
As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.
14
57
529
We empirically prove this with surgical experiments: 🐍 Directly rewarding string “python” → +11.8% performance 🚫 Random rewards BUT blocking code → gains disappear The "magic" is just surfacing useful patterns already learned in pre-training.
3
3
129
As expected, that was popular. Here is my attempt at consolidating all the answers into a list. - Prenorm: normalization in the residual blocks before the attention operation and the FFN respectively - GQA (Group Query Attention): more Q than (K, V)
8
65
702
There are more ways to improve model quality apart from chucking in more compute (although the latter is what keeps me employed). Great work!
🧠💡 What if your 7B model could beat GPT-4o and Qwen2.5-72B—using just 11k training samples? No distillation. No warm-start. Just smart data and reinforcement learning. Inspired by Moravec’s Paradox, we let the model decide what's actually hard. 🚨 New paper: "SoTA with Less:
0
0
6
Passing by luxury clothing retailers (Gucci, Prada, etc), I feel lucky that CS researchers don't waste money to play these types of vapid status games. Anyway, this is Dr. Prof. Kamath (rank-n university) thrilled to announce my group's 23 accepted NeurIPS papers!!!!
9
33
886
We are on a roll, second successful dissertation defense in a week (March 28)! Congratulations to @siddharth_3773 on becoming the second PhD graduate from PSSG!! Dissertation title: "Optimizing Communication in Parallel Deep Learning on Exascale-class Machines" #HPC #AI #HPC4AI
0
1
13
🚀 We fixed a major LLM post-training bottleneck! Our new method (TBA) combines trajectory balance with asynchronous training to speed up LLM RL 5-50x while improving results+scalability. For example, using VinePPO's GSM8K setup, we obtain +1.2% accuracy and 50x faster RL.
3
53
257
Announcing Neo-1: the world’s most advanced atomistic foundation model, unifying structure prediction and all-atom de novo generation for the first time - to decode and design the structure of life 🧵(1/10)
39
374
2K
So erm, I caught ChatGPT lying ?? - Response says it has "checked" the PyTorch github repo - The thinking trace reveals that it actually didn't.
0
0
4
Adversarial attacks jailbreak models. Existing defenses don’t even safeguard simple models. In our ICML 2024 paper on "Adversarial Robustness Limits”, we show how scaling helps defense, up to the point where attacks start to fool humans (take quiz: https://t.co/mmLDnvTErY). 🧵1/n
1
8
30
A radically new approach towards test time compute!
New open source reasoning model! Huginn-3.5B reasons implicitly in latent space 🧠 Unlike O1 and R1, latent reasoning doesn’t need special chain-of-thought training data, and doesn't produce extra CoT tokens at test time. We trained on 800B tokens 👇
0
0
14
Ok, so I can finally talk about this! We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale. The model has an internal latent space in which it can adaptively spend more compute to think longer. I think the tech report ...🐦⬛
54
199
2K
We have a postdoc opening in our group on the intersection of HPC and AI, specifically on developing and applying Code LLMs to HPC software development. Please help us spread the word: https://t.co/i4Gi5xPsTr Position will remain open until filled. #HPC #AI
0
2
6