setlur_amrith Profile Banner
Amrith Setlur Profile
Amrith Setlur

@setlur_amrith

Followers
808
Following
401
Media
41
Statuses
128

Phd Student at CMU.

Pittsburgh, PA
Joined April 2020
Don't wanna be here? Send us removal request.
@setlur_amrith
Amrith Setlur
13 days
Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over correct responses sampled by the pretrained LLM OR is it exploring and discovering new strategies 🤔?. Find answers in our latest post ⬇️.
2
24
146
@setlur_amrith
Amrith Setlur
23 days
This work was an amazing collaboration 🤝 with an awesome set of co-authors ✨: @matthewyryang @ianwu97 @sea_snell @JeremyGreerOumi @gingsmith @max_simchowitz and @aviral_kumar2! Big thanks to @gneubig and @XiongChenyan for generous compute support 🙏.
0
1
5
@setlur_amrith
Amrith Setlur
23 days
Recent works on max. confidence, sharpening, & spurious rewards argue RL is only finding a mode already in the base LLM, but we argue that this is only when training on bad length budgets for the data. If we optimize for extrapolation via chaining, we see the true promise of RL!
Tweet media one
1
1
4
@setlur_amrith
Amrith Setlur
23 days
In summary we have the power of three 🔌 in e3.- Asymmetries in base LLM.- Neg. gradients in RL.- Coupled task & budget curriculum. What do we get?.✅ In-context exploration.✅ Extrapolation of test compute.😉 Also a pun on the popular RL alg for exploration E3 @mkearnsupenn.
1
1
3
@setlur_amrith
Amrith Setlur
23 days
e3 learns in-context search, going beyond the recent wave on RL + sharpening. It builds on our past work too.1. RL (pos+neg grad) >> SFT (pos grad only) 2. Optimizing test-time compute is a meta RL problem.
2
2
5
@setlur_amrith
Amrith Setlur
23 days
🎯 e3 recipe enables RL to discover new solutions and not just distill a pass@k policy over the base model into pass@1. e3 improves pass@32 over our base LLM and also other models trained to optimize pass@k 😮 ⤵️
Tweet media one
1
1
3
@setlur_amrith
Amrith Setlur
23 days
Results 🚀: Qwen3-1.7B finetuned with our e3 recipe outperforms many <2B open source reasoning models on AIME/HMMT '25, and also some 7B and 32B models (a,b) 💪. Our model naturally (no test-time prompting needed) extrapolates test-compute better than budget forcing in s1 (c).
Tweet media one
1
1
3
@setlur_amrith
Amrith Setlur
23 days
Why is coupled curriculum needed 🤔? Fixing a.❌ low token budget curtails length increase & discovery during "thinking".❌ high budget is harder to optimize (grad variance in policy grad).❌ training on hard problems at a lower budget again kills the chaining of asymmetries.
Tweet media one
1
1
3
@setlur_amrith
Amrith Setlur
23 days
#3 Coupled budget & task curriculum where we jointly increase token budget and length to incentivize in-context exploration: first train on easy problems at a lower budget (8k) and then train on harder problems at a longer budget of 16k tokens!
Tweet media one
1
1
3
@setlur_amrith
Amrith Setlur
23 days
We explicitly mask neg. grads during GRPO and find neg. grads actually drive exploration (high entropy in d) in RL, but when base LLM presents VG gap, this enables in-context search or meta RL (more verifications in b), improving accuracy (a). Thus, we go from RL ▶️ Meta RL!
Tweet media one
2
1
3
@setlur_amrith
Amrith Setlur
23 days
#2 Negative gradients explain the increasing response length in RL (most recent RL works ignore length trends) 😮 Neg. grads explicitly disincentivize premature termination of wrong answers, p(<eos> | wrong ans) 🔻. During RL, #2 + #1 increases length & # of chained asymmetries!
Tweet media one
1
1
3
@setlur_amrith
Amrith Setlur
23 days
#1 Chaining ⛓️ asymmetric capabilities in base LLM.When base LLM has a bias to chain verification (easy) with generation (hard), & exploits Ver-Gen (VG) gap, RL amplifies the chaining of asymmetries to discover strategies, diff. from sharpening, as it composes useful primitives!
Tweet media one
1
2
5
@setlur_amrith
Amrith Setlur
23 days
RL with our recipe e3 (Exploration Enables Extrapolation), can get LLMs to:.🚀 implement an in-context search algo.,.📈 enables extrapolation of test compute (training on max 16k tokens, but improve performance till 32k test tokens). e3 recipe 🍪 has 3 key ingredients, see ⤵️
Tweet media one
2
2
4
@setlur_amrith
Amrith Setlur
23 days
Introducing e3 🔥 Best <2B model on math 💪.Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️.🚨 🚨
Tweet media one
1
20
89
@setlur_amrith
Amrith Setlur
2 months
Attend our Scaling Self-improvement workshop @iclr_conf (Garnet 214-215) for some amazing talks and a fiery panel discussion (5-6pm)🔥.
@robertarail
Roberta Raileanu
2 months
With a stellar lineup of speakers and panelists, including Yoshua Bengio 🙀, the Scaling Self-Improving Foundation Models at @iclr_conf promises to be 🔥. ⏰ Sunday, April 27.📍 Garnet 214-215
Tweet media one
0
4
29
@setlur_amrith
Amrith Setlur
2 months
I couldn't be there @iclr_conf but if you are interested in process verifiers that can boost exploration and get LLMs to solve hard problems, check out our spotlight poster on PAVs at 3pm Hall 3+2B #548. Also chat with the amazing @ianwu97 who will be presenting on our behalf!
Tweet media one
0
5
37
@setlur_amrith
Amrith Setlur
3 months
It's easy to (pre-)train LLMs by imitating discrete actions (next tokens). Surprisingly, imitating *continuous* actions (eg in robots 🤖) is "exponentially" hard for *any* algorithm🤯 that only uses expert data, even when the expert is deterministic🙀! Check out this cool work:.
@max_simchowitz
Max Simchowitz
3 months
There’s a lot of awesome research about LLM reasoning right now. But how is  learning in the physical world 🤖different than in language 📚?. In a new paper, show that imitation learning in continuous spaces can be exponentially harder than for discrete state spaces, even when
0
0
15
@setlur_amrith
Amrith Setlur
3 months
This was a cool collaboration led by Kevin Kuo, with @AdtRaghunathan and @gingsmith. Questions and feedback always welcome 🙏.For details check out:
0
0
0
@setlur_amrith
Amrith Setlur
3 months
Overall, this is one of the first works on large-scale *exact* unlearning of finetuning data that exploits model merging & localization! Definitely, more work is needed to address the computational/utility tradeoffs for unlearning (we also discuss tradeoffs like storage costs).
1
0
1
@setlur_amrith
Amrith Setlur
3 months
SIFT-Masks.- 250x 🤯 cheaper (than re-training) to unlearn 500 tasks.- And the accuracy is very close to central training ‼️
Tweet media one
Tweet media two
1
0
0