Siddharth Singh @siddharth_3773 X Profile

Siddharth Singh

@siddharth_3773

Followers

859

Following

768

Media

1

Statuses

55

Research Scientist @nvidia, CS Ph.D from UMD. Building massively parallel inference systems.

https://t.co/dCSyi57pBA

Santa Clara, CA

Joined June 2024

Don't wanna be here? Send us removal request.

Siddharth Singh

@siddharth_3773

11 months

I'll be joining Nvidia's Applied Deep Learning Research team ( https://t.co/obch93B6N8) from May of next year. Excited to kickstart my industry career with the immensely cracked systems team that works on Megatron-LM!

45

38

1K

Abhinav Bhatele

@bhatele

11 days

A large number of PhD students in my group have graduated or will be graduating by Spring, so I am recruiting several PhD students for the next admission cycle (Fall 2026). If you want to work with us, apply by Dec 5 and drop me a short email. Please repost/share widely. #HPC #AI

9

63

212

Jonas Geiping @ Neurips

@jonasgeiping

1 month

There's been a lot of discussion recently about parallel vs sequential reasoning. The recurrent models we trained this year are sequential, which makes them good at math, but slow (see pic) However, if you squint, models with recurrent-depth/loops are like diffusion models ...

3

16

70

Siddharth Singh

@siddharth_3773

1 month

Scenes when an LLM invests in VOO and outperforms everyone 😂

0

1

Aditya Tomar

@adityastomar_

3 months

Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats

26

91

668

Shashwat Goel

@ShashwatGoel7

5 months

Glad we could together improve the scientific discourse around reasoning. Was great to see the authors reach out and incorporate all our feedback!

Mihir Prabhudesai

@mihirp98

5 months

1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing

1

5

24

Mihir Prabhudesai

@mihirp98

5 months

1/ Maximizing confidence indeed improves reasoning. We worked with @ShashwatGoel7, @nikhilchandak29 @AmyPrb for the past 3 weeks (over a zoom call and many emails!) and revised our evaluations to align with their suggested prompts/parsers/sampling params. This includes changing

Shashwat Goel

@ShashwatGoel7

6 months

Confused about recent LLM RL results where models improve without any ground-truth signal? We were too. Until we looked at the reported numbers of the Pre-RL models and realized they were serverely underreported across papers. We compiled discrepancies in a blog below🧵👇

1

15

52

Ahmad Beirami

@abeirami

6 months

As we go through a lot of excitement about RL recently with lots of cool work/results, here is a reminder that RL with a reverse KL-regularizer to the base model cannot learn new skills that were not already present in the base model. It can only amplify the existing weak skills.

14

57

529

Stella Li @NeurIPS 2025

@StellaLisy

6 months

We empirically prove this with surgical experiments: 🐍 Directly rewarding string “python” → +11.8% performance 🚫 Random rewards BUT blocking code → gains disappear The "magic" is just surfacing useful patterns already learned in pre-training.

3

129

François Fleuret

@francoisfleuret

7 months

As expected, that was popular. Here is my attempt at consolidating all the answers into a list. - Prenorm: normalization in the residual blocks before the attention operation and the FFN respectively - GQA (Group Query Attention): more Q than (K, V)

François Fleuret

@francoisfleuret

7 months

What do you think are the most important improvements of the 2017 transformer architecture?

8

65

702

Siddharth Singh

@siddharth_3773

7 months

There are more ways to improve model quality apart from chucking in more compute (although the latter is what keeps me employed). Great work!

Furong Huang

@furongh

7 months

🧠💡 What if your 7B model could beat GPT-4o and Qwen2.5-72B—using just 11k training samples? No distillation. No warm-start. Just smart data and reinforcement learning. Inspired by Moravec’s Paradox, we let the model decide what's actually hard. 🚨 New paper: "SoTA with Less:

0

6

Gautam Kamath ✈️ NeurIPS 2025

@thegautamkamath

8 months

Passing by luxury clothing retailers (Gucci, Prada, etc), I feel lucky that CS researchers don't waste money to play these types of vapid status games. Anyway, this is Dr. Prof. Kamath (rank-n university) thrilled to announce my group's 23 accepted NeurIPS papers!!!!

9

33

886

Parallel Software and Systems Group

@hpc_group

8 months

We are on a roll, second successful dissertation defense in a week (March 28)! Congratulations to @siddharth_3773 on becoming the second PhD graduate from PSSG!! Dissertation title: "Optimizing Communication in Parallel Deep Learning on Exascale-class Machines" #HPC #AI #HPC4AI

0

1

13

Brian Bartoldson

@bartoldson

8 months

🚀 We fixed a major LLM post-training bottleneck! Our new method (TBA) combines trajectory balance with asynchronous training to speed up LLM RL 5-50x while improving results+scalability. For example, using VinePPO's GSM8K setup, we obtain +1.2% accuracy and 50x faster RL.

3

53

257

VantAI

@vant_ai

8 months

Announcing Neo-1: the world’s most advanced atomistic foundation model, unifying structure prediction and all-atom de novo generation for the first time - to decode and design the structure of life 🧵(1/10)

39

374

2K

Siddharth Singh

@siddharth_3773

8 months

So erm, I caught ChatGPT lying ?? - Response says it has "checked" the PyTorch github repo - The thinking trace reveals that it actually didn't.

0

4

Siddharth Singh

@siddharth_3773

9 months

30k pa 🥰

The PhD Place

@ThePhDPlace

9 months

I’m doing a PhD for the money and the prestige.

0

3

Brian Bartoldson

@bartoldson

1 year

Adversarial attacks jailbreak models. Existing defenses don’t even safeguard simple models. In our ICML 2024 paper on "Adversarial Robustness Limits”, we show how scaling helps defense, up to the point where attacks start to fool humans (take quiz: https://t.co/mmLDnvTErY). 🧵1/n

1

8

30

Siddharth Singh

@siddharth_3773

10 months

A radically new approach towards test time compute!

Tom Goldstein

@tomgoldsteincs

10 months

New open source reasoning model! Huginn-3.5B reasons implicitly in latent space 🧠 Unlike O1 and R1, latent reasoning doesn’t need special chain-of-thought training data, and doesn't produce extra CoT tokens at test time. We trained on 800B tokens 👇

0

14

Jonas Geiping @ Neurips

@jonasgeiping

10 months

Ok, so I can finally talk about this! We spent the last year (actually a bit longer) training an LLM with recurrent depth at scale. The model has an internal latent space in which it can adaptively spend more compute to think longer. I think the tech report ...🐦‍⬛

54

199

2K

Parallel Software and Systems Group

@hpc_group

10 months

We have a postdoc opening in our group on the intersection of HPC and AI, specifically on developing and applying Code LLMs to HPC software development. Please help us spread the word: https://t.co/i4Gi5xPsTr Position will remain open until filled. #HPC #AI

0

2

6