Aviral Kumar Profile
Aviral Kumar

@aviral_kumar2

Followers
4K
Following
912
Media
195
Statuses
344

Assistant Professor of CS & ML at @CarnegieMellon. Part-time Research Scientist Google. PhD from UC Berkeley.

Pittsburgh, PA
Joined May 2016
Don't wanna be here? Send us removal request.
@aviral_kumar2
Aviral Kumar
4 days
Checkout these awesome new real-robot online RL fine-tuning results that @andy_peng05 and @zhiyuan_zhou_ got with our WSRL method. WSRL appeared at ICLR earlier this year -- check this out for more details: 👇.
@zhiyuan_zhou_
Paul Zhou
4 days
We tested WSRL (Warm-start RL) on a Franka Robot, and it leads to really efficient online RL fine-tuning in the real world!. WSRL learned the peg insertion task perfectly with only 11 minutes of warmup and *7 minutes* of online RL interactions 👇🧵
0
4
49
@aviral_kumar2
Aviral Kumar
13 days
RT @setlur_amrith: Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over co….
0
24
0
@aviral_kumar2
Aviral Kumar
13 days
If you are interested in learning more about some of the work discussed in the post, check out:. - e3: curricula for structured exploration & discovery (best <2B model on AIME/HMMT): - dense rewards for reasoning: - PAVs:.
0
0
7
@aviral_kumar2
Aviral Kumar
13 days
Given the confusion around what RL does for reasoning in LLMs, @setlur_amrith & I wrote a new blog post on when RL simply sharpens the base model & when it discovers new reasoning strategies. Learn how to measure discovery + methods to enable it ⬇️.
4
37
268
@aviral_kumar2
Aviral Kumar
24 days
And this work would not have been possible without the new CMU FLAME cluster ( with 256 H100 GPUs!.
@gneubig
Graham Neubig
1 month
I'd like to announce that the CMU FLAME center ( has a new cluster!. It is 256 H100 GPUs, which we'll use to perform larger experiments, build more useful artifacts, and continue our tradition of open research. Expect to see more like this in the future👇.
0
0
7
@aviral_kumar2
Aviral Kumar
24 days
RT @matthewyryang: 🚨 NEW PAPER: What if LLMs could tackle harder problems - not by explicitly training on longer traces, but by learning ho….
0
2
0
@aviral_kumar2
Aviral Kumar
24 days
RT @setlur_amrith: Introducing e3 🔥 Best <2B model on math 💪.Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharp….
0
20
0
@aviral_kumar2
Aviral Kumar
24 days
Checkout's @setlur_amrith's post for more details, including a discussion of how in-context exploration is different from work that claims RL only "sharpens" around the base model's capabilities (largely due to the data/budgets being trained upon).
@setlur_amrith
Amrith Setlur
24 days
Introducing e3 🔥 Best <2B model on math 💪.Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️.🚨 🚨
Tweet media one
1
0
4
@aviral_kumar2
Aviral Kumar
24 days
This work builds on some prior work & insights my students at CMU have done on test-time compute:. 1. RL vs SFT: 2. Blog: 3. Dense rewards:
1
0
5
@aviral_kumar2
Aviral Kumar
24 days
This was a very fun collab, led by @setlur_amrith & @matthewyryang w/ @ianwu97 @sea_snell @JeremyGreerOumi @gingsmith and @max_simchowitz. I learned a lot!. Website: Paper: Code, training data, ckpts. are all released.
1
0
7
@aviral_kumar2
Aviral Kumar
24 days
With these three ingredients, we amplify in-context exploration and get SOTA results under 2B! The e3-1.7B model does better than 7B (OpenThinker) & even some 32B (s1-32B) models, without any explicit prompting to use more compute!. More results in the paper, w/ didactic tasks
Tweet media one
1
0
5
@aviral_kumar2
Aviral Kumar
24 days
Doing so is critical, since training on very long budgets is super hard for RL from an optimization / variance perspective. And training on small budgets curbs useful "in-context exploration" and chaining. We also prescribe a thumb rule for choosing the "right" budgets given
Tweet media one
1
0
3
@aviral_kumar2
Aviral Kumar
24 days
Ingredient 3: A "coupled" curriculum .It is important to keep RL in the "chaining" regime & doing so requires training on the right data,w/ the right max. token budget. We train on easy problems at a lower budget (8k) and then train on hard problems at a longer budget of 16k.
Tweet media one
1
0
7
@aviral_kumar2
Aviral Kumar
24 days
Ingredient 2: Negative gradient.There's a debate b/w RL & SFT, but we show that slightly off-policy RL w/ neg. grad learns to chain these skills to use more test compute in a better way. Chaining ➡️ length 📈, verifications 📈➡️ perf 📈. Related to stitching in offline RL,
Tweet media one
1
0
8
@aviral_kumar2
Aviral Kumar
24 days
To get LLMs to extrapolate, we had 3 ingredients:. Ingredient 1: Asymmetries in base model.The base model needs to have asymmetric competence in different "skills" (generation/verification/etc.) s.t. chaining more skills ➡️ 📈perf. If so, RL can discover chaining strategies.
Tweet media one
2
0
6
@aviral_kumar2
Aviral Kumar
24 days
Now, let's dive into details!. Problem: we wanted to build a model that could extrapolate test-time compute, meaning that it could translate more compute into better in-context search to improve performance. This is the real test of how effectively models can learn to reason, I
Tweet media one
1
0
6
@aviral_kumar2
Aviral Kumar
24 days
Full results below. Key ingredients of e3:.🔑Use base model with asymmetric competence in generation/verification.🔑Use off-policy RL so that negative grad can learn to "chain" these skills, grow length, increase perf.🔑Use coupled curriculum over data + budget to control
Tweet media one
1
0
7
@aviral_kumar2
Aviral Kumar
24 days
Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️
Tweet media one
2
28
182
@aviral_kumar2
Aviral Kumar
25 days
Also check out our past work doing RL with agents:. DigiRL: Digi-Q: TTI extends this line of work along the axis of scaling interaction to enable rich behaviors, building on very similar machinery from these prior works.
0
1
3
@aviral_kumar2
Aviral Kumar
25 days
This was a fun collab led by @JunhongShen1 @jackbai_jkb, w/ @LunjunZhang @YifeiZhou02 @setlur_amrith @atalwalkar and many others!. Paper: Website: Code: Please reach out if you have feedback!.
1
0
4