Rishabh Tiwari
@rish2k1
Followers
804
Following
140
Media
14
Statuses
65
RS Intern @Meta | CS PhD @UCBerkeley | Ex-@GoogleAI | Research area: Efficient and robust AI systems
Berkeley, CA
Joined May 2019
There is so much noise in the LLM RL space, so we sat down and ran everything at scale (so you dont have to ๐) and presenting to you โThe Art of Scaling RLโ Give this a read before starting your next RL run. Led by amazing @Devvrit_Khatri @lovish
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
3
20
220
Excited to share one of the first projects from my PhD! We find that Adam (often seen as approximate second-order) can actually outperform Gauss-Newton (true second-order) in certain cases! Our 2x2 comparison across basis choice and gradient noise is revealing! Thread by Sham:
(1/9) Diagonal preconditioners such as Adam typically use empirical gradient information rather than true second-order curvature. Is this merely a computational compromise or can it be advantageous? Our work confirms the latter: Adam can outperform Gauss-Newton in certain cases.
2
14
107
Great to see our algorithmic work validated at scale! CISPO started as a stability fix during our MiniMax-01 training, an answer to spiky gradients and train-inference discrepancies. Seeing it become a core component of ScaleRL in The Art of Scaling Reinforcement Learning
Meta just dropped this paper that spills the secret sauce of reinforcement learning (RL) on LLMs. It lays out an RL recipe, uses 400,000 GPU hrs and posits a scaling law for performance with more compute in RL, like the classic pretraining scaling laws. Must read for AI nerds.
3
8
88
(1/9) Diagonal preconditioners such as Adam typically use empirical gradient information rather than true second-order curvature. Is this merely a computational compromise or can it be advantageous? Our work confirms the latter: Adam can outperform Gauss-Newton in certain cases.
2
18
129
Huge thanks to Devvrit Khatri for coming on the Delta Podcast! Check out the podcast episode here: https://t.co/wmsDjqFbPn
2
2
7
This work provides many deep insights into scaling RL for LLMs! Congratulations @Devvrit_Khatri @louvishh and all the coauthors. Also amazing to see so many close friends and collaborators, including four of our former predocs/RF write this nice paper.
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
0
4
56
*checks chatgpt* This paper costs ~4.2 million USD (400K GB200 hours) -- science! Our most expensive run was a 100K GPU hour (same amount as Deepseek-R1-zero but on GB200s). One finding here was that once we have a scalable RL algorithm, RL compute scaling becomes predictable
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
19
73
835
๐จ New Paper: The Art of Scaling Reinforcement Learning Compute for LLMs ๐จ We burnt a lot of GPU-hours to provide the community with the first open, large-scale systematic study on RL scaling for LLMs. https://t.co/49REQZ4R6G
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
2
15
67
Sneak peak from a paper about scaling RL compute for LLMs: probably the most compute-expensive paper I've worked on, but hoping that others can run experiments cheaply for the science of scaling RL. Coincidentally, this is similar motivation to what we had for the NeurIPS best
11
37
417
Feel like I'm taking crazy pills. We are just back at step one. Donโt store KV cache, just recompute it.
Can we break the memory wall for LLM inference via KV cache rematerialization? ๐จ Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! โข 10โ12.5x memory savings vs. FP16 โข Near-zero accuracy loss โข Beats
29
23
540
Can we break the memory wall for LLM inference via KV cache rematerialization? ๐จ Introducing XQuant, which leverages underutilized compute units to eliminate the memory bottleneck for LLM inference! โข 10โ12.5x memory savings vs. FP16 โข Near-zero accuracy loss โข Beats
26
92
667
Heading to @COLM_conf in Montreal? So is @WiMLworkshop! ๐ We are organizing our first ever event at #CoLM2025 and we want you to choose the format! What excites you the most? Have a different idea? Let us know in the replies! ๐ RT to spread the word! โฉ
1
9
37
How does prompt optimization compare to RL algos like GRPO? GRPO needs 1000s of rollouts, but humans can learn from a few trialsโby reflecting on what worked & what didn't. Meet GEPA: a reflective prompt optimizer that can outperform GRPO by up to 20% with 35x fewer rollouts!๐งต
46
170
1K
Meows, music, murmurs and more! We train a general purpose audio encoder and open source the code, checkpoints and evaluation toolkit.
Shikhar Bharadwaj, Samuele Cornell, Kwanghee Choi, Satoru Fukayama, Hye-jin Shim, Soham Deshmukh, Shinji Watanabe, "OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder,"
0
15
35
If you are at ICML this year, make sure to catch @rish2k1 at the Efficient Systems for Foundation Models Workshop at east exhibition hall A to learn more about our work on accelerating test-time scaling methods to achieve better latency/accuracy tradeoffs!
At es-fomo workshop, talk to @rish2k1 about scaling test-time compute as a function of user-facing latency (instead of FLOPS)
0
3
21
At es-fomo workshop, talk to @rish2k1 about scaling test-time compute as a function of user-facing latency (instead of FLOPS)
[Sat Jul 19] @Nived_Rajaraman & @rish2k1 present work on improving accuracy-latency tradeoffs for test-time scaling. @gh_aminian presents work showing that a smoothened version of best-of-n gives improves reward vs KL tradeoffs when a low-quality proxy reward is used.
1
3
24
๐จCome check out our poster at #ICML2025! QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache ๐ East Exhibition Hall A-B โ #E-2608 ๐๏ธ Poster Session 5 | Thu, Jul 17 | ๐ 11:00 AM โ1:30 PM TLDR: Use a quantized version of the same model as its own draft
๐ Fast and accurate Speculative Decoding for Long Context? ๐Problem: ๐นStandard speculative decoding struggles with long-context generation, as current draft models are pretty weak for long context ๐นFinding the right draft model is tricky, as compatibility varies across
0
8
37
Really interesting work by my friend @harman26singh on making reward models more robust, effectively reducing reliance on spurious attributes
๐จ New @GoogleDeepMind paper ๐๐จ๐๐ฎ๐ฌ๐ญ ๐๐๐ฐ๐๐ซ๐ ๐๐จ๐๐๐ฅ๐ข๐ง๐ ๐ฏ๐ข๐ ๐๐๐ฎ๐ฌ๐๐ฅ ๐๐ฎ๐๐ซ๐ข๐๐ฌ ๐ ๐ https://t.co/oCk5jGNYlj We tackle reward hackingโwhen RMs latch onto spurious cues (e.g. length, style) instead of true quality. #RLAIF #CausalInference ๐งตโฌ๏ธ
0
1
7
If you had 15min to tell thousands of Berkeley CS/Data/Stats grads what to do with their lives, what would you say? Last Thursday I told them to RUN AT FAILURE. Afterwards, while we were shaking hands & taking selfies, hundreds of them told me that they are excited to go fail. I
18
32
270
Excited to share that our paper Quantspec has been accepted to #ICML2025! Huge thanks to my collaborators! Paper:
arxiv.org
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios,...
๐ Fast and accurate Speculative Decoding for Long Context? ๐Problem: ๐นStandard speculative decoding struggles with long-context generation, as current draft models are pretty weak for long context ๐นFinding the right draft model is tricky, as compatibility varies across
0
6
41