Mian Wu
@MerlinNoth79247
Followers
194
Following
94
Media
6
Statuses
24
Reinforcement learning && LLM | Research at Robotics AI & Learning Lab @berkeley_ai | https://t.co/Itq6wFMmoL
Joined May 2025
Can we run RL to train LLMs on hard-to-verify or open-ended tasks? Even when tasks are verifiable, it is often impossible to check every design detail or catch all mistakes.. We can go prompt-tune LLM judges, but is that really the answer? Our new paper introduces RLAC: a
9
57
338
My acceptance speech at the Turing award ceremony: Good evening ladies and gentlemen. The main idea of reinforcement learning is that a machine might discover what to do on its own, without being told, from its own experience, by trial and error. As far as I know, the first
62
225
2K
Check out our paper for more results and discussions! Thanks to my amazing advisors @aviral_kumar2 @svlevine @sewon__min for all their guidance and support! Website: https://t.co/RwIkIMtjog Paper:
arxiv.org
Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification...
0
3
17
RLAC also works well on code generation tasks, consistently achieving better performance and requiring fewer testcase executions than both enumerative and reward-model approaches on most benchmarks.
1
2
9
We evaluate RLAC on factual text generation (concise biographies). It produces more accurate outputs while using far fewer validator calls, up to around 5–6× fewer checks for longer biographies.
1
3
10
Here’s how we train RLAC in practice. The generator produces answers, the critic proposes checks, the validator labels them, and we update both models. This cycle can work with any online or offline RL algorithm.
1
3
14
Formally, satisfying all rubrics is equivalent to a minimum over rubrics, which yields a min–max objective:
1
3
11
RLAC takes a third approach. For each answer, a critic tries to pinpoint the most likely mistake as a check (what we call a rubric in the paper), and an external validator tests that check. If the check fails, the critic is rewarded. If it passes, the generator is rewarded.
1
1
12
Why is this hard? In free-form tasks (long outputs, code, math proofs), an answer may satisfy many hidden checks. Checking all of them is expensive, so RL post-training either enumerates checks (accurate but slow) or uses one learned reward score (cheap but easy to game).
1
2
18
Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over correct responses sampled by the pretrained LLM OR is it exploring and discovering new strategies 🤔? Find answers in our latest post ⬇️ https://t.co/WCEq3K4dB0
pinnate-flare-8f3.notion.site
Amrith Setlur and Aviral Kumar, Carnegie Mellon University
2
30
154
NEW PAPER: "CaRT: Teaching LLM Agents to Know When They Know Enough"! LLMs often overthink, ask too many questions, or waste compute. We introduce Counterfactuals and Reasoning for Termination (CaRT) - teaching LLMs when to stop gathering info and make decisions. 🧵[1/9]
1
12
36
Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️
2
32
182
Now we can have self-refinement for VLAs in the real world (with the aid of a big VLM)! VLM critiques VLA rollouts and iteratively refines the commands to make it perform better.
LLMs have shown a remarkable ability to “self-refine” and learn from their mistakes via in-context learning. But in robotics, most methods are single-shot. How can we bring inference-time adaptation to robot learning? A 🧵:
5
25
246
Introducing *dual representations*! tl;dr: We represent a state by the "set of similarities" to all other states. This dual perspective has lots of nice properties and practical benefits in RL. Blog post: https://t.co/lw1PortD9E Paper: https://t.co/zYKFjyOy7C ↓
14
98
790
A new VLA for navigation that can take in goal images, positions, and language, and exhibits some pretty neat emergent language following!
We trained OmniVLA, a robotic foundation model for navigation conditioned on language, goal poses, and images. Initialized with OpenVLA, it leverages Internet-scale knowledge for strong OOD performance. Great collaboration with @CatGlossop, @shahdhruv_, and @svlevine.
6
47
374
We've added pi-05 to the openpi repo: pi05-base, pi05-droid, pi05-libero. Also added PyTorch training code!🔥 Instructions and code here: https://t.co/EOhNYfpq9B This is an updated version of the model we showed cleaning kitchens and bedrooms in April:
7
5
162
Language following is a tough problem for VLAs: while these models can follow complex language, in practice getting datasets that enable language following is hard. We developed a method to counterfactually and automatically label data to improve language following! 🧵👇
8
69
417
Thanks for the invite! Excited to be presenting our work on training MoE over distributed data next Monday!
We have a fun collaboration of @GPU_MODE x @scaleml coming up! We’re hosting a week-long online bootcamp that explores the core components of GPT-OSS while also diving into cutting-edge research that pushes beyond what’s currently in GPT-OSS! For example, how can MoE's power
1
6
117
Given the confusion around what RL does for reasoning in LLMs, @setlur_amrith & I wrote a new blog post on when RL simply sharpens the base model & when it discovers new reasoning strategies. Learn how to measure discovery + methods to enable it ⬇️ https://t.co/Ax7zEiy2DT
pinnate-flare-8f3.notion.site
Amrith Setlur and Aviral Kumar, Carnegie Mellon University
4
41
274