Sai Surya Duvvuri
@dvsaisurya
Followers
546
Following
1K
Media
10
Statuses
196
Visiting Researcher at FAIR, Meta and CS PhD student at UT Austin. Previously, SR at Google | Pre-Doctoral Research Fellow at MSR India | CS UG at IIT KGP
Austin, Texas
Joined November 2011
🚨 New Paper: The Art of Scaling Reinforcement Learning Compute for LLMs 🚨 We burnt a lot of GPU-hours to provide the community with the first open, large-scale systematic study on RL scaling for LLMs. https://t.co/49REQZ4R6G
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
2
15
68
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
11
103
552
Sneak peak from a paper about scaling RL compute for LLMs: probably the most compute-expensive paper I've worked on, but hoping that others can run experiments cheaply for the science of scaling RL. Coincidentally, this is similar motivation to what we had for the NeurIPS best
11
37
416
New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: https://t.co/w5ZDsHDDPE Code: https://t.co/7UgKuD9Yll Paper:
arxiv.org
Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on...
137
656
4K
One compelling explanation for why Adam beats SGD is that Adam does better at optimizing the losses of rare classes. Somewhat surprisingly, the improvement of Muon appears to be due to Muon optimizing the tail even better than Adam. It's all about the heavy tail.
4
59
502
There are traditionally two types of research: problem-driven research and method-driven research. As we’ve seen with large language models and now AlphaEvolve, it should be very clear now that total method-driven research is a huge opportunity. Problem-driven research is nice
21
92
721
Pretty cool paper: https://t.co/YNIGvkqZAE Muon seems to organize the weight updates in an isotropic/unbiased fashion, so that performance on tail examples (perhaps tougher?) is better than Adam.
Muon vs Adam. The gain of Muon mainly comes from its effect on the attention value and output, FFN matrices. And it learns the knowledge in the tail of the distribution faster. Optimizer inductive biases? ( https://t.co/50faHIXUoo) Though it could be a matter of speed.
0
0
1
What happens when one of @GoogleDeepMind's top scientists sits down to unpack AI’s past, present & future? The full episode with @jainprateek_ is here. 🎙 Topics you can’t miss: 🔹 Deep learning → transformers → generative AI 🔹 India’s once-in-a-generation chance to lead in
0
4
13
We just released AlgoPerf v0.6! 🎉 ✅ Rolling leaderboard ✅ Lower compute costs ✅ JAX jit migration ✅ Bug fixes & flexible API Coming soon: More contemporary baselines + an LM workload… https://t.co/QBOqGvqNWG
github.com
MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. - mlcommons/algori...
0
9
46
FlashAttention in 3D? Our latest blog explores the #kernel design of 2-Simplicial #Attention, modeling the algorithm with a hardware aligned design and rewriting the entire kernel in TLX (Triton Low Level Extensions). 🔗 https://t.co/7Enro9MHTI
#PyTorch #OpenSourceAI
4
35
191
so apparently swe-bench doesn’t filter out future repo states (with the answers) and the agents sometimes figure this out… https://t.co/dCxr8EALhq
github.com
We've identified multiple loopholes with SWE Bench Verified where agents may look at future repository state (by querying it directly or through a variety of methods), and cases in which future...
7
19
300
Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats
26
92
668
I managed to prompt gpt-5-thinking into proving the tight 1.75/L matching v2 of the arxiv paper. From the arxiv paper, it was clear that this problem is perfect for PEP framework. I told gpt to do a search in the coefficients for combining cocoercivity at different pairs of
Claim: gpt-5-pro can prove new interesting mathematics. Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it's correct. Details below.
11
26
245
most impressive part of GPT-5 is the jump in long-context how do you even do this? produce some strange long range synthetic data? scan lots of books?
33
28
352
GPT OSS is out. It's OpenAI's first open-weights model release since GPT-2, and some of the technical innovations have huge implications. This is a thread about two of them: Learned attention sinks, and MXFP4 weights.
2
37
186
I think LoRA-RITE (led by Jui-Nan Yen) is one of very interesting/novel work I've been part of. Consider a function f(AB^T), where A and B are tall and thin parameter matrices. This function is invariant to matrix transformations M, A = A'M, B = B'M^{-T}. But is the optimization
🧵 1/ LoRA, a popular parameter-efficient training method, can suffer from imbalanced updates. Observe how standard optimizers (like Adam's blue line below) lead to vastly different updates for LoRA factors A & B. Our LoRA-RITE (purple) optimizer solves this! 👇
0
3
7
Come see how a careful gradient analysis of standard attention reveals that backpropagated gradients can often become tiny, thereby slowing down learning. Solution: LASER that conducts Softmax attention in exponentiated value space. Work led by the impressuve @dvsaisurya
📢 Thrilled to share our new paper, LASER: Attention with Exponential Transformation, accepted at ICML2025, work done at Google. Come by our poster presentation! 🗓️ Thurs, July 17th, 4:30-7pm 📍 West Exhibition Hall B2-B3, # W-915 Read the full paper here:
0
2
9
top-k greedy inference for diffusion models can unlock better accuracies. Wondering if finding the optimal order to unmask the tokens can be automated across prompts/tasks.
Excited about this new work where we dig into the role of token order in masked diffusions! MDMs train on some horribly hard tasks, but careful planning at inference can sidestep the hardest ones, dramatically improving over vanilla MDM sampling (e.g. 7%->90% acc on Sudoku) 1/
0
0
2