Siddharth Singh Profile
Siddharth Singh

@siddharth_3773

Followers
859
Following
768
Media
1
Statuses
55

Research Scientist @nvidia, CS Ph.D from UMD. Building massively parallel inference systems.

Santa Clara, CA
Joined June 2024
Don't wanna be here? Send us removal request.
@siddharth_3773
Siddharth Singh
11 months
I'll be joining Nvidia's Applied Deep Learning Research team ( https://t.co/obch93B6N8) from May of next year. Excited to kickstart my industry career with the immensely cracked systems team that works on Megatron-LM!
45
38
1K
@bhatele
Abhinav Bhatele
11 days
A large number of PhD students in my group have graduated or will be graduating by Spring, so I am recruiting several PhD students for the next admission cycle (Fall 2026). If you want to work with us, apply by Dec 5 and drop me a short email. Please repost/share widely. #HPC #AI
9
63
212
@jonasgeiping
Jonas Geiping @ Neurips
1 month
There's been a lot of discussion recently about parallel vs sequential reasoning. The recurrent models we trained this year are sequential, which makes them good at math, but slow (see pic) However, if you squint, models with recurrent-depth/loops are like diffusion models ...
3
16
70
@siddharth_3773
Siddharth Singh
1 month
Scenes when an LLM invests in VOO and outperforms everyone 😂
0
0
1
@adityastomar_
Aditya Tomar
3 months
Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats
26
91
668
@ShashwatGoel7
Shashwat Goel
5 months
Glad we could together improve the scientific discourse around reasoning. Was great to see the authors reach out and incorporate all our feedback!
@mihirp98
Mihir Prabhudesai
5 months
1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing
1
5
24
@mihirp98
Mihir Prabhudesai
5 months
1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing
@ShashwatGoel7
Shashwat Goel
6 months
Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇
1
15
52
@abeirami
Ahmad Beirami
6 months
As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.
14
57
529
@StellaLisy
Stella Li @NeurIPS 2025
6 months
We empirically prove this with surgical experiments: 🐍 Directly rewarding string “python” → +11.8% performance 🚫 Random rewards BUT blocking code → gains disappear The "magic" is just surfacing useful patterns already learned in pre-training.
3
3
129
@francoisfleuret
François Fleuret
7 months
As expected, that was popular. Here is my attempt at consolidating all the answers into a list. - Prenorm: normalization in the residual blocks before the attention operation and the FFN respectively - GQA (Group Query Attention): more Q than (K, V)
@francoisfleuret
François Fleuret
7 months
What do you think are the most important improvements of the 2017 transformer architecture?
8
65
702
@siddharth_3773
Siddharth Singh
7 months
There are more ways to improve model quality apart from chucking in more compute (although the latter is what keeps me employed). Great work!
@furongh
Furong Huang
7 months
🧠💡 What if your 7B model could beat GPT-4o and Qwen2.5-72B—using just 11k training samples? No distillation. No warm-start. Just smart data and reinforcement learning. Inspired by Moravec’s Paradox, we let the model decide what's actually hard. 🚨 New paper: "SoTA with Less:
0
0
6
@thegautamkamath
Gautam Kamath ✈️ NeurIPS 2025
8 months
Passing by luxury clothing retailers (Gucci, Prada, etc), I feel lucky that CS researchers don't waste money to play these types of vapid status games. Anyway, this is Dr. Prof. Kamath (rank-n university) thrilled to announce my group's 23 accepted NeurIPS papers!!!!
9
33
886
@hpc_group
Parallel Software and Systems Group
8 months
We are on a roll, second successful dissertation defense in a week (March 28)! Congratulations to @siddharth_3773 on becoming the second PhD graduate from PSSG!! Dissertation title: "Optimizing Communication in Parallel Deep Learning on Exascale-class Machines" #HPC #AI #HPC4AI
0
1
13
@bartoldson
Brian Bartoldson
8 months
🚀 We fixed a major LLM post-training bottleneck! Our new method (TBA) combines trajectory balance with asynchronous training to speed up LLM RL 5-50x while improving results+scalability. For example, using VinePPO's GSM8K setup, we obtain +1.2% accuracy and 50x faster RL.
3
53
257
@vant_ai
VantAI
8 months
Announcing Neo-1: the world’s most advanced atomistic foundation model, unifying structure prediction and all-atom de novo generation for the first time - to decode and design the structure of life 🧵(1/10)
39
374
2K
@siddharth_3773
Siddharth Singh
8 months
So erm, I caught ChatGPT lying ?? - Response says it has "checked" the PyTorch github repo - The thinking trace reveals that it actually didn't.
0
0
4
@siddharth_3773
Siddharth Singh
9 months
30k pa 🥰
@ThePhDPlace
The PhD Place
9 months
I’m doing a PhD for the money and the prestige.
0
0
3
@bartoldson
Brian Bartoldson
1 year
Adversarial attacks jailbreak models. Existing defenses don’t even safeguard simple models. In our ICML 2024 paper on "Adversarial Robustness Limits”, we show how scaling helps defense, up to the point where attacks start to fool humans (take quiz: https://t.co/mmLDnvTErY). 🧵1/n
1
8
30
@siddharth_3773
Siddharth Singh
10 months
A radically new approach towards test time compute!
@tomgoldsteincs
Tom Goldstein
10 months
New open source reasoning model! Huginn-3.5B reasons implicitly in latent space 🧠 Unlike O1 and R1, latent reasoning doesn’t need special chain-of-thought training data, and doesn't produce extra CoT tokens at test time. We trained on 800B tokens 👇
0
0
14
@jonasgeiping
Jonas Geiping @ Neurips
10 months
Ok, so I can finally talk about this! We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale. The model has an internal latent space in which it can adaptively spend more compute to think longer. I think the tech report ...🐦‍⬛
54
199
2K
@hpc_group
Parallel Software and Systems Group
10 months
We have a postdoc opening in our group on the intersection of HPC and AI, specifically on developing and applying Code LLMs to HPC software development. Please help us spread the word: https://t.co/i4Gi5xPsTr Position will remain open until filled. #HPC #AI
0
2
6