Amrith Setlur @setlur_amrith X Profile

Amrith Setlur

@setlur_amrith

Followers

808

Following

401

Media

41

Statuses

128

Phd Student at CMU.

Pittsburgh, PA

Joined April 2020

Don't wanna be here? Send us removal request.

Amrith Setlur

@setlur_amrith

13 days

Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over correct responses sampled by the pretrained LLM OR is it exploring and discovering new strategies 🤔?. Find answers in our latest post ⬇️.

2

24

146

Amrith Setlur

@setlur_amrith

23 days

This work was an amazing collaboration 🤝 with an awesome set of co-authors ✨: @matthewyryang @ianwu97 @sea_snell @JeremyGreerOumi @gingsmith @max_simchowitz and @aviral_kumar2! Big thanks to @gneubig and @XiongChenyan for generous compute support 🙏.

0

1

5

Amrith Setlur

@setlur_amrith

23 days

Recent works on max. confidence, sharpening, & spurious rewards argue RL is only finding a mode already in the base LLM, but we argue that this is only when training on bad length budgets for the data. If we optimize for extrapolation via chaining, we see the true promise of RL!

1

4

Amrith Setlur

@setlur_amrith

23 days

In summary we have the power of three 🔌 in e3.- Asymmetries in base LLM.- Neg. gradients in RL.- Coupled task & budget curriculum. What do we get?.✅ In-context exploration.✅ Extrapolation of test compute.😉 Also a pun on the popular RL alg for exploration E3 @mkearnsupenn.

1

3

Amrith Setlur

@setlur_amrith

23 days

e3 learns in-context search, going beyond the recent wave on RL + sharpening. It builds on our past work too.1. RL (pos+neg grad) >> SFT (pos grad only) 2. Optimizing test-time compute is a meta RL problem.

2

5

Amrith Setlur

@setlur_amrith

23 days

🎯 e3 recipe enables RL to discover new solutions and not just distill a pass@k policy over the base model into pass@1. e3 improves pass@32 over our base LLM and also other models trained to optimize pass@k 😮 ⤵️

1

3

Amrith Setlur

@setlur_amrith

23 days

Results 🚀: Qwen3-1.7B finetuned with our e3 recipe outperforms many <2B open source reasoning models on AIME/HMMT '25, and also some 7B and 32B models (a,b) 💪. Our model naturally (no test-time prompting needed) extrapolates test-compute better than budget forcing in s1 (c).

1

3

Amrith Setlur

@setlur_amrith

23 days

Why is coupled curriculum needed 🤔? Fixing a.❌ low token budget curtails length increase & discovery during "thinking".❌ high budget is harder to optimize (grad variance in policy grad).❌ training on hard problems at a lower budget again kills the chaining of asymmetries.

1

3

Amrith Setlur

@setlur_amrith

23 days

#3 Coupled budget & task curriculum where we jointly increase token budget and length to incentivize in-context exploration: first train on easy problems at a lower budget (8k) and then train on harder problems at a longer budget of 16k tokens!

1

3

Amrith Setlur

@setlur_amrith

23 days

We explicitly mask neg. grads during GRPO and find neg. grads actually drive exploration (high entropy in d) in RL, but when base LLM presents VG gap, this enables in-context search or meta RL (more verifications in b), improving accuracy (a). Thus, we go from RL ▶️ Meta RL!

2

1

3

Amrith Setlur

@setlur_amrith

23 days

#2 Negative gradients explain the increasing response length in RL (most recent RL works ignore length trends) 😮 Neg. grads explicitly disincentivize premature termination of wrong answers, p(<eos> | wrong ans) 🔻. During RL, #2 + #1 increases length & # of chained asymmetries!

1

3

Amrith Setlur

@setlur_amrith

23 days

#1 Chaining ⛓️ asymmetric capabilities in base LLM.When base LLM has a bias to chain verification (easy) with generation (hard), & exploits Ver-Gen (VG) gap, RL amplifies the chaining of asymmetries to discover strategies, diff. from sharpening, as it composes useful primitives!

1

2

5

Amrith Setlur

@setlur_amrith

23 days

RL with our recipe e3 (Exploration Enables Extrapolation), can get LLMs to:.🚀 implement an in-context search algo.,.📈 enables extrapolation of test compute (training on max 16k tokens, but improve performance till 32k test tokens). e3 recipe 🍪 has 3 key ingredients, see ⤵️

2

4

Amrith Setlur

@setlur_amrith

23 days

Introducing e3 🔥 Best <2B model on math 💪.Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️.🚨 🚨

1

20

89

Amrith Setlur

@setlur_amrith

2 months

Attend our Scaling Self-improvement workshop @iclr_conf (Garnet 214-215) for some amazing talks and a fiery panel discussion (5-6pm)🔥.

Roberta Raileanu

@robertarail

2 months

With a stellar lineup of speakers and panelists, including Yoshua Bengio 🙀, the Scaling Self-Improving Foundation Models at @iclr_conf promises to be 🔥. ⏰ Sunday, April 27.📍 Garnet 214-215

0

4

29

Amrith Setlur

@setlur_amrith

2 months

I couldn't be there @iclr_conf but if you are interested in process verifiers that can boost exploration and get LLMs to solve hard problems, check out our spotlight poster on PAVs at 3pm Hall 3+2B #548. Also chat with the amazing @ianwu97 who will be presenting on our behalf!

0

5

37

Amrith Setlur

@setlur_amrith

3 months

It's easy to (pre-)train LLMs by imitating discrete actions (next tokens). Surprisingly, imitating *continuous* actions (eg in robots 🤖) is "exponentially" hard for *any* algorithm🤯 that only uses expert data, even when the expert is deterministic🙀! Check out this cool work:.

Max Simchowitz

@max_simchowitz

3 months

There’s a lot of awesome research about LLM reasoning right now. But how is learning in the physical world 🤖different than in language 📚?. In a new paper, show that imitation learning in continuous spaces can be exponentially harder than for discrete state spaces, even when

0

15

Amrith Setlur

@setlur_amrith

3 months

This was a cool collaboration led by Kevin Kuo, with @AdtRaghunathan and @gingsmith. Questions and feedback always welcome 🙏.For details check out:

0

Amrith Setlur

@setlur_amrith

3 months

Overall, this is one of the first works on large-scale *exact* unlearning of finetuning data that exploits model merging & localization! Definitely, more work is needed to address the computational/utility tradeoffs for unlearning (we also discuss tradeoffs like storage costs).

1

0

1

Amrith Setlur

@setlur_amrith

3 months

SIFT-Masks.- 250x 🤯 cheaper (than re-training) to unlearn 500 tasks.- And the accuracy is very close to central training ‼️

1

0