Sai Surya Duvvuri Profile
Sai Surya Duvvuri

@dvsaisurya

Followers
546
Following
1K
Media
10
Statuses
196

Visiting Researcher at FAIR, Meta and CS PhD student at UT Austin. Previously, SR at Google | Pre-Doctoral Research Fellow at MSR India | CS UG at IIT KGP

Austin, Texas
Joined November 2011
Don't wanna be here? Send us removal request.
@louvishh
lovish
19 days
🚨 New Paper: The Art of Scaling Reinforcement Learning Compute for LLMs 🚨 We burnt a lot of GPU-hours to provide the community with the first open, large-scale systematic study on RL scaling for LLMs. https://t.co/49REQZ4R6G
@Devvrit_Khatri
Devvrit
19 days
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
2
15
68
@Devvrit_Khatri
Devvrit
19 days
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
11
103
552
@agarwl_
Rishabh Agarwal
20 days
Sneak peak from a paper about scaling RL compute for LLMs: probably the most compute-expensive paper I've worked on, but hoping that others can run experiments cheaply for the science of scaling RL. Coincidentally, this is similar motivation to what we had for the NeurIPS best
11
37
416
@karpathy
Andrej Karpathy
27 days
POV: Your LLM agent is dividing a by b
114
143
2K
@jm_alexia
Alexia Jolicoeur-Martineau
28 days
New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: https://t.co/w5ZDsHDDPE Code: https://t.co/7UgKuD9Yll Paper:
Tweet card summary image
arxiv.org
Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on...
137
656
4K
@konstmish
Konstantin Mishchenko
1 month
One compelling explanation for why Adam beats SGD is that Adam does better at optimizing the losses of rare classes. Somewhat surprisingly, the improvement of Muon appears to be due to Muon optimizing the tail even better than Adam. It's all about the heavy tail.
4
59
502
@_jasonwei
Jason Wei
5 months
There are traditionally two types of research: problem-driven research and method-driven research. As we’ve seen with large language models and now AlphaEvolve, it should be very clear now that total method-driven research is a huge opportunity. Problem-driven research is nice
21
92
721
@dvsaisurya
Sai Surya Duvvuri
1 month
Pretty cool paper: https://t.co/YNIGvkqZAE Muon seems to organize the weight updates in an isotropic/unbiased fashion, so that performance on tail examples (perhaps tougher?) is better than Adam.
@rosinality
Rosinality
1 month
Muon vs Adam. The gain of Muon mainly comes from its effect on the attention value and output, FFN matrices. And it learns the knowledge in the tail of the distribution faster. Optimizer inductive biases? ( https://t.co/50faHIXUoo) Though it could be a matter of speed.
0
0
1
@Primevp_in
PrimeVenturePartners
1 month
What happens when one of @GoogleDeepMind's top scientists sits down to unpack AI’s past, present & future? The full episode with @jainprateek_ is here. 🎙 Topics you can’t miss: 🔹 Deep learning → transformers → generative AI 🔹 India’s once-in-a-generation chance to lead in
0
4
13
@algoperf
AlgoPerf
2 months
We just released AlgoPerf v0.6! 🎉 ✅ Rolling leaderboard ✅ Lower compute costs ✅ JAX jit migration ✅ Bug fixes & flexible API Coming soon: More contemporary baselines + an LM workload… https://t.co/QBOqGvqNWG
Tweet card summary image
github.com
MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. - mlcommons/algori...
0
9
46
@PyTorch
PyTorch
2 months
FlashAttention in 3D? Our latest blog explores the #kernel design of 2-Simplicial #Attention, modeling the algorithm with a hardware aligned design and rewriting the entire kernel in TLX (Triton Low Level Extensions). 🔗 https://t.co/7Enro9MHTI #PyTorch #OpenSourceAI
4
35
191
@adityastomar_
Aditya Tomar
3 months
Can we break the memory wall for LLM inference via KV cache rematerialization? 🚨 Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! • 10–12.5x memory savings vs. FP16 • Near-zero accuracy loss • Beats
26
92
668
@jasondeanlee
Jason Lee
2 months
I managed to prompt gpt-5-thinking into proving the tight 1.75/L matching v2 of the arxiv paper. From the arxiv paper, it was clear that this problem is perfect for PEP framework. I told gpt to do a search in the coefficients for combining cocoercivity at different pairs of
@SebastienBubeck
Sebastien Bubeck
3 months
Claim: gpt-5-pro can prove new interesting mathematics. Proof: I took a convex optimization paper with a clean open problem in it and asked gpt-5-pro to work on it. It proved a better bound than what is in the paper, and I checked the proof it's correct. Details below.
11
26
245
@jxmnop
dr. jack morris
3 months
most impressive part of GPT-5 is the jump in long-context how do you even do this? produce some strange long range synthetic data? scan lots of books?
33
28
352
@carrigmat
Matthew Carrigan
3 months
GPT OSS is out. It's OpenAI's first open-weights model release since GPT-2, and some of the technical innovations have huge implications. This is a thread about two of them: Learned attention sinks, and MXFP4 weights.
2
37
186
@inderjit_ml
Inderjit Dhillon
3 months
AI Mode - try it out!
@thefox
Nick Fox
3 months
Just ~2 months in and AI Mode already has over 100 million monthly active users in the U.S. and India, and it's great to hear the awesome response from our users!! Now, we’re excited to ship AI Mode, our most powerful AI search, to the UK! Starting to roll out today. 🇬🇧
0
1
3
@dvsaisurya
Sai Surya Duvvuri
3 months
I think LoRA-RITE (led by Jui-Nan Yen) is one of very interesting/novel work I've been part of. Consider a function f(AB^T), where A and B are tall and thin parameter matrices. This function is invariant to matrix transformations M, A = A'M, B = B'M^{-T}. But is the optimization
@inderjit_ml
Inderjit Dhillon
6 months
🧵 1/ LoRA, a popular parameter-efficient training method, can suffer from imbalanced updates. Observe how standard optimizers (like Adam's blue line below) lead to vastly different updates for LoRA factors A & B. Our LoRA-RITE (purple) optimizer solves this! 👇
0
3
7
@inderjit_ml
Inderjit Dhillon
4 months
Come see how a careful gradient analysis of standard attention reveals that backpropagated gradients can often become tiny, thereby slowing down learning. Solution: LASER that conducts Softmax attention in exponentiated value space. Work led by the impressuve @dvsaisurya
@dvsaisurya
Sai Surya Duvvuri
4 months
📢 Thrilled to share our new paper, LASER: Attention with Exponential Transformation, accepted at ICML2025, work done at Google. Come by our poster presentation! 🗓️ Thurs, July 17th, 4:30-7pm 📍 West Exhibition Hall B2-B3, # W-915 Read the full paper here:
0
2
9
@dvsaisurya
Sai Surya Duvvuri
4 months
top-k greedy inference for diffusion models can unlock better accuracies. Wondering if finding the optimal order to unmask the tokens can be automated across prompts/tasks.
@sitanch
Sitan Chen
9 months
Excited about this new work where we dig into the role of token order in masked diffusions! MDMs train on some horribly hard tasks, but careful planning at inference can sidestep the hardest ones, dramatically improving over vanilla MDM sampling (e.g. 7%->90% acc on Sudoku) 1/
0
0
2