Aviral Kumar @aviral_kumar2 X Profile

Aviral Kumar

@aviral_kumar2

Followers

4K

Following

912

Media

195

Statuses

344

Assistant Professor of CS & ML at @CarnegieMellon. Part-time Research Scientist Google. PhD from UC Berkeley.

Pittsburgh, PA

Joined May 2016

Don't wanna be here? Send us removal request.

Aviral Kumar

@aviral_kumar2

4 days

Checkout these awesome new real-robot online RL fine-tuning results that @andy_peng05 and @zhiyuan_zhou_ got with our WSRL method. WSRL appeared at ICLR earlier this year -- check this out for more details: 👇.

Paul Zhou

@zhiyuan_zhou_

4 days

We tested WSRL (Warm-start RL) on a Franka Robot, and it leads to really efficient online RL fine-tuning in the real world!. WSRL learned the peg insertion task perfectly with only 11 minutes of warmup and *7 minutes* of online RL interactions 👇🧵

0

4

49

Aviral Kumar

@aviral_kumar2

13 days

RT @setlur_amrith: Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over co….

0

24

0

Aviral Kumar

@aviral_kumar2

13 days

If you are interested in learning more about some of the work discussed in the post, check out:. - e3: curricula for structured exploration & discovery (best <2B model on AIME/HMMT): - dense rewards for reasoning: - PAVs:.

0

7

Aviral Kumar

@aviral_kumar2

13 days

Given the confusion around what RL does for reasoning in LLMs, @setlur_amrith & I wrote a new blog post on when RL simply sharpens the base model & when it discovers new reasoning strategies. Learn how to measure discovery + methods to enable it ⬇️.

4

37

268

Aviral Kumar

@aviral_kumar2

24 days

And this work would not have been possible without the new CMU FLAME cluster ( with 256 H100 GPUs!.

Graham Neubig

@gneubig

1 month

I'd like to announce that the CMU FLAME center ( has a new cluster!. It is 256 H100 GPUs, which we'll use to perform larger experiments, build more useful artifacts, and continue our tradition of open research. Expect to see more like this in the future👇.

0

7

Aviral Kumar

@aviral_kumar2

24 days

RT @matthewyryang: 🚨 NEW PAPER: What if LLMs could tackle harder problems - not by explicitly training on longer traces, but by learning ho….

0

2

0

Aviral Kumar

@aviral_kumar2

24 days

RT @setlur_amrith: Introducing e3 🔥 Best <2B model on math 💪.Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharp….

0

20

0

Aviral Kumar

@aviral_kumar2

24 days

Checkout's @setlur_amrith's post for more details, including a discussion of how in-context exploration is different from work that claims RL only "sharpens" around the base model's capabilities (largely due to the data/budgets being trained upon).

Amrith Setlur

@setlur_amrith

24 days

Introducing e3 🔥 Best <2B model on math 💪.Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️.🚨 🚨

1

0

4

Aviral Kumar

@aviral_kumar2

24 days

This work builds on some prior work & insights my students at CMU have done on test-time compute:. 1. RL vs SFT: 2. Blog: 3. Dense rewards:

1

0

5

Aviral Kumar

@aviral_kumar2

24 days

This was a very fun collab, led by @setlur_amrith & @matthewyryang w/ @ianwu97 @sea_snell @JeremyGreerOumi @gingsmith and @max_simchowitz. I learned a lot!. Website: Paper: Code, training data, ckpts. are all released.

1

0

7

Aviral Kumar

@aviral_kumar2

24 days

With these three ingredients, we amplify in-context exploration and get SOTA results under 2B! The e3-1.7B model does better than 7B (OpenThinker) & even some 32B (s1-32B) models, without any explicit prompting to use more compute!. More results in the paper, w/ didactic tasks

1

0

5

Aviral Kumar

@aviral_kumar2

24 days

Doing so is critical, since training on very long budgets is super hard for RL from an optimization / variance perspective. And training on small budgets curbs useful "in-context exploration" and chaining. We also prescribe a thumb rule for choosing the "right" budgets given

1

0

3

Aviral Kumar

@aviral_kumar2

24 days

Ingredient 3: A "coupled" curriculum .It is important to keep RL in the "chaining" regime & doing so requires training on the right data,w/ the right max. token budget. We train on easy problems at a lower budget (8k) and then train on hard problems at a longer budget of 16k.

1

0

7

Aviral Kumar

@aviral_kumar2

24 days

Ingredient 2: Negative gradient.There's a debate b/w RL & SFT, but we show that slightly off-policy RL w/ neg. grad learns to chain these skills to use more test compute in a better way. Chaining ➡️ length 📈, verifications 📈➡️ perf 📈. Related to stitching in offline RL,

1

0

8

Aviral Kumar

@aviral_kumar2

24 days

To get LLMs to extrapolate, we had 3 ingredients:. Ingredient 1: Asymmetries in base model.The base model needs to have asymmetric competence in different "skills" (generation/verification/etc.) s.t. chaining more skills ➡️ 📈perf. If so, RL can discover chaining strategies.

2

0

6

Aviral Kumar

@aviral_kumar2

24 days

Now, let's dive into details!. Problem: we wanted to build a model that could extrapolate test-time compute, meaning that it could translate more compute into better in-context search to improve performance. This is the real test of how effectively models can learn to reason, I

1

0

6

Aviral Kumar

@aviral_kumar2

24 days

Full results below. Key ingredients of e3:.🔑Use base model with asymmetric competence in generation/verification.🔑Use off-policy RL so that negative grad can learn to "chain" these skills, grow length, increase perf.🔑Use coupled curriculum over data + budget to control

1

0

7

Aviral Kumar

@aviral_kumar2

24 days

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️

2

28

182

Aviral Kumar

@aviral_kumar2

25 days

Also check out our past work doing RL with agents:. DigiRL: Digi-Q: TTI extends this line of work along the axis of scaling interaction to enable rich behaviors, building on very similar machinery from these prior works.

0

1

3

Aviral Kumar

@aviral_kumar2

25 days

This was a fun collab led by @JunhongShen1 @jackbai_jkb, w/ @LunjunZhang @YifeiZhou02 @setlur_amrith @atalwalkar and many others!. Paper: Website: Code: Please reach out if you have feedback!.

1

0

4