Sai Surya Duvvuri @dvsaisurya X Profile

Sai Surya Duvvuri

@dvsaisurya

Followers

546

Following

1K

Media

10

Statuses

196

Visiting Researcher at FAIR, Meta and CS PhD student at UT Austin. Previously, SR at Google | Pre-Doctoral Research Fellow at MSR India | CS UG at IIT KGP

Austin, Texas

Joined November 2011

Don't wanna be here? Send us removal request.

lovish

@louvishh

19 days

🚨 New Paper: The Art of Scaling Reinforcement Learning Compute for LLMs 🚨 We burnt a lot of GPU-hours to provide the community with the first open, large-scale systematic study on RL scaling for LLMs. https://t.co/49REQZ4R6G

Devvrit

@Devvrit_Khatri

19 days

Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs

2

15

68

Devvrit

@Devvrit_Khatri

19 days

Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs

11

103

552

Rishabh Agarwal

@agarwl_

20 days

Sneak peak from a paper about scaling RL compute for LLMs: probably the most compute-expensive paper I've worked on, but hoping that others can run experiments cheaply for the science of scaling RL. Coincidentally, this is similar motivation to what we had for the NeurIPS best

11

37

416

Andrej Karpathy

@karpathy

27 days

POV: Your LLM agent is dividing a by b

114

143

2K

Alexia Jolicoeur-Martineau

@jm_alexia

28 days

New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: https://t.co/w5ZDsHDDPE Code: https://t.co/7UgKuD9Yll Paper:

arxiv.org

Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on...

137

656

4K

Konstantin Mishchenko

@konstmish

1 month

One compelling explanation for why Adam beats SGD is that Adam does better at optimizing the losses of rare classes. Somewhat surprisingly, the improvement of Muon appears to be due to Muon optimizing the tail even better than Adam. It's all about the heavy tail.

4

59

502

Jason Wei

@_jasonwei

5 months

There are traditionally two types of research: problem-driven research and method-driven research. As we’ve seen with large language models and now AlphaEvolve, it should be very clear now that total method-driven research is a huge opportunity. Problem-driven research is nice

21

92

721

Sai Surya Duvvuri

@dvsaisurya

1 month

Pretty cool paper: https://t.co/YNIGvkqZAE Muon seems to organize the weight updates in an isotropic/unbiased fashion, so that performance on tail examples (perhaps tougher?) is better than Adam.

Rosinality

@rosinality

1 month

Muon vs Adam. The gain of Muon mainly comes from its effect on the attention value and output, FFN matrices. And it learns the knowledge in the tail of the distribution faster. Optimizer inductive biases? ( https://t.co/50faHIXUoo) Though it could be a matter of speed.

0

1

PrimeVenturePartners

@Primevp_in

1 month

What happens when one of @GoogleDeepMind's top scientists sits down to unpack AI’s past, present & future? The full episode with @jainprateek_ is here. 🎙 Topics you can’t miss: 🔹 Deep learning → transformers → generative AI 🔹 India’s once-in-a-generation chance to lead in

0

4

13

AlgoPerf

@algoperf

2 months

We just released AlgoPerf v0.6! 🎉 ✅ Rolling leaderboard ✅ Lower compute costs ✅ JAX jit migration ✅ Bug fixes & flexible API Coming soon: More contemporary baselines + an LM workload… https://t.co/QBOqGvqNWG

github.com

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. - mlcommons/algori...

0

9

46

PyTorch

@PyTorch

2 months

FlashAttention in 3D? Our latest blog explores the #kernel design of 2-Simplicial #Attention, modeling the algorithm with a hardware aligned design and rewriting the entire kernel in TLX (Triton Low Level Extensions). 🔗 https://t.co/7Enro9MHTI #PyTorch #OpenSourceAI

4

35

191

Bram Wasti

@bwasti

2 months

so apparently swe-bench doesn’t filter out future repo states (with the answers) and the agents sometimes figure this out… https://t.co/dCxr8EALhq

github.com

We've identified multiple loopholes with SWE Bench Verified where agents may look at future repository state (by querying it directly or through a variety of methods), and cases in which future...

7

19

300

Aditya Tomar

@adityastomar_

3 months

Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats

26

92

668

Jason Lee

@jasondeanlee

2 months

I managed to prompt gpt-5-thinking into proving the tight 1.75/L matching v2 of the arxiv paper. From the arxiv paper, it was clear that this problem is perfect for PEP framework. I told gpt to do a search in the coefficients for combining cocoercivity at different pairs of

Sebastien Bubeck

@SebastienBubeck

3 months

Claim: gpt-5-pro can prove new interesting mathematics. Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it's correct. Details below.

11

26

245

dr. jack morris

@jxmnop

3 months

most impressive part of GPT-5 is the jump in long-context how do you even do this? produce some strange long range synthetic data? scan lots of books?

33

28

352

Matthew Carrigan

@carrigmat

3 months

GPT OSS is out. It's OpenAI's first open-weights model release since GPT-2, and some of the technical innovations have huge implications. This is a thread about two of them: Learned attention sinks, and MXFP4 weights.

2

37

186

Inderjit Dhillon

@inderjit_ml

3 months

AI Mode - try it out!

Nick Fox

@thefox

3 months

Just ~2 months in and AI Mode already has over 100 million monthly active users in the U.S. and India, and it's great to hear the awesome response from our users!! Now, we’re excited to ship AI Mode, our most powerful AI search, to the UK! Starting to roll out today. 🇬🇧

0

1

3

Sai Surya Duvvuri

@dvsaisurya

3 months

I think LoRA-RITE (led by Jui-Nan Yen) is one of very interesting/novel work I've been part of. Consider a function f(AB^T), where A and B are tall and thin parameter matrices. This function is invariant to matrix transformations M, A = A'M, B = B'M^{-T}. But is the optimization

Inderjit Dhillon

@inderjit_ml

6 months

🧵 1/ LoRA, a popular parameter-efficient training method, can suffer from imbalanced updates. Observe how standard optimizers (like Adam's blue line below) lead to vastly different updates for LoRA factors A & B. Our LoRA-RITE (purple) optimizer solves this! 👇

0

3

7

Inderjit Dhillon

@inderjit_ml

4 months

Come see how a careful gradient analysis of standard attention reveals that backpropagated gradients can often become tiny, thereby slowing down learning. Solution: LASER that conducts Softmax attention in exponentiated value space. Work led by the impressuve @dvsaisurya

Sai Surya Duvvuri

@dvsaisurya

4 months

📢 Thrilled to share our new paper, LASER: Attention with Exponential Transformation, accepted at ICML2025, work done at Google. Come by our poster presentation! 🗓️ Thurs, July 17th, 4:30-7pm 📍 West Exhibition Hall B2-B3, # W-915 Read the full paper here:

0

2

9

Sai Surya Duvvuri

@dvsaisurya

4 months

top-k greedy inference for diffusion models can unlock better accuracies. Wondering if finding the optimal order to unmask the tokens can be automated across prompts/tasks.

Sitan Chen

@sitanch

9 months

Excited about this new work where we dig into the role of token order in masked diffusions! MDMs train on some horribly hard tasks, but careful planning at inference can sidestep the hardest ones, dramatically improving over vanilla MDM sampling (e.g. 7%->90% acc on Sudoku) 1/

0

2