Devvrit
@Devvrit_Khatri
Followers
791
Following
2K
Media
18
Statuses
177
GradStudent@UTCompSci, Meta. Large Scale ML - Scalability and Efficiency. Past: DeepMind.
Joined December 2019
Congrats to the Meta team on ScaleRL! Interesting to see it adopt a reasoning-length control mechanism similar to what we introduced in Elastic Reasoning, using forced interruptions (e.g., “”) to improve RL training stability. Exciting to see this idea validated at scale!
4
8
62
This is a great blog explaining the progress in scaling RL and our work. Pretty clear, intuitive, and captures the key takeaways (and limitations :)). Thanks, @natolambert!
My very positive review of the ScaleRL paper. Excited for more work data+base models work to be built around this (Pythia-style olmo suite???). For now, the key things to get off the ground with RL are: Importance sampling, in-flight updates, and continuous batching.
0
0
16
Thanks, @deedydas, for sharing our work. By AI nerds, for AI nerds :)
Meta just dropped this paper that spills the secret sauce of reinforcement learning (RL) on LLMs. It lays out an RL recipe, uses 400,000 GPU hrs and posits a scaling law for performance with more compute in RL, like the classic pretraining scaling laws. Must read for AI nerds.
0
1
13
Had an amazing time on the Delta Podcast about our recent Scaling RL work, future directions, and some fun broader conversation. Thanks for having me on :)
Huge thanks to Devvrit Khatri for coming on the Delta Podcast! Check out the podcast episode here: https://t.co/wmsDjqFbPn
1
4
46
Thanks so much :) Indeed, understanding and the science of doing RL has a long way to go :)
The cleanest RL scaling results I've seen so far🤯. Amazing to see how much valuable insights you can get when the premise is not necessarily to come up with a "new" method and just figure out what works (ofc while also being supercharged with 400K gpu hours). Congratssss
0
0
13
Thanks, @omarsar0, for the visibility. Pretty great concise summary and interpretation of the work :)
Banger paper from Meta and collaborators. This paper is one of the best deep dives yet on how reinforcement learning (RL) actually scales for LLMs. The team ran over 400,000 GPU hours of experiments to find a predictable scaling pattern and a stable recipe (ScaleRL) that
1
2
17
Even I am surprised that’s how much we spent 😅 RL becoming predictable is an amazing insight. We now know how to compare two methods. And scaling across all these different axes - shows that RL is indeed embracing the bitter lesson..
*checks chatgpt* This paper costs ~4.2 million USD (400K GB200 hours) -- science! Our most expensive run was a 100K GPU hour (same amount as Deepseek-R1-zero but on GB200s). One finding here was that once we have a scalable RL algorithm, RL compute scaling becomes predictable
0
0
15
🚨 New Paper: The Art of Scaling Reinforcement Learning Compute for LLMs 🚨 We burnt a lot of GPU-hours to provide the community with the first open, large-scale systematic study on RL scaling for LLMs. https://t.co/49REQZ4R6G
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
2
15
66
Work done at Meta (thanks for the gb200s :p), with awesome collaborators including @louvishh, @rish2k1, @rach_it_, @dvsaisurya, Manzil Zaheer, @inderjit_ml, @brandfonbrener, and @agarwl_ Paper: https://t.co/okiL3xDHuO My blog Link (work in progress):
arxiv.org
Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training....
2
6
34
Would a larger model with less steps or smaller model with more train steps reach a certain performance faster? Answering such questions (figure in 1st tweet), we see “early sparks” of RL scaling laws in sight.
1
2
14
Would “scaling” up along generation length/model size/batch-size give expected gains? Absolutely! And now we can analyze how exactly they improve the performance. For example, smaller bsz/gen len may seem better initially, but larger ones overtake eventually.
2
2
18
Common “tricks” mainly shift efficiency: loss aggregation, normalization, curriculum, etc. Large batch size, large generation length, loss type, off-policy setup, and train/inference kernel mismatch fixes are the most consequential.
1
2
20
Not all RL methods scale equally well. Some reach higher asymptotic performance than others. Methods that may look promising early on can be worse when extrapolating to a larger compute regime.
1
3
24
Framework: We fit sigmoidal curves to an iid validation set. Results? (1) We can now predict RL performance at larger scale. (2) We can analyze each algorithmic choice and how it affects the scaling.
1
2
17
We provide (a) a framework to fit such scaling curves. Using this, we analyze several design choices, and combine the best ones to form our recipe (b) ScaleRL. We demonstrate its effectiveness by predictably scaling to 100k GPU-hours.
1
2
22
How do we understand the contribution of several design choices in an RL algorithm? Do they make the algorithm efficient? Or do they elevate the asymptotic performance? To study the scaling behavior of each design choice, we need to fit a predictable scaling curve - this provides
1
3
26
Wish to build scaling laws for RL but not sure how to scale? Or what scales? Or would RL even scale predictably? We introduce: The Art of Scaling Reinforcement Learning Compute for LLMs
10
103
550
Sneak peak from a paper about scaling RL compute for LLMs: probably the most compute-expensive paper I've worked on, but hoping that others can run experiments cheaply for the science of scaling RL. Coincidentally, this is similar motivation to what we had for the NeurIPS best
11
36
417
What happens when one of @GoogleDeepMind's top scientists sits down to unpack AI’s past, present & future? The full episode with @jainprateek_ is here. 🎙 Topics you can’t miss: 🔹 Deep learning → transformers → generative AI 🔹 India’s once-in-a-generation chance to lead in
0
4
13