Mian Wu Profile
Mian Wu

@MerlinNoth79247

Followers
194
Following
94
Media
6
Statuses
24

Reinforcement learning && LLM | Research at Robotics AI & Learning Lab @berkeley_ai | https://t.co/Itq6wFMmoL

Joined May 2025
Don't wanna be here? Send us removal request.
@MerlinNoth79247
Mian Wu
7 days
Can we run RL to train LLMs on hard-to-verify or open-ended tasks? Even when tasks are verifiable, it is often impossible to check every design detail or catch all mistakes.. We can go prompt-tune LLM judges, but is that really the answer? Our new paper introduces RLAC: a
9
57
338
@RichardSSutton
Richard Sutton
2 months
My acceptance speech at the Turing award ceremony: Good evening ladies and gentlemen. The main idea of reinforcement learning is that a machine might discover what to do on its own, without being told, from its own experience, by trial and error. As far as I know, the first
62
225
2K
@MerlinNoth79247
Mian Wu
6 days
RLAC also works well on code generation tasks, consistently achieving better performance and requiring fewer testcase executions than both enumerative and reward-model approaches on most benchmarks.
1
2
9
@MerlinNoth79247
Mian Wu
6 days
We evaluate RLAC on factual text generation (concise biographies). It produces more accurate outputs while using far fewer validator calls, up to around 5–6× fewer checks for longer biographies.
1
3
10
@MerlinNoth79247
Mian Wu
6 days
Here’s how we train RLAC in practice. The generator produces answers, the critic proposes checks, the validator labels them, and we update both models. This cycle can work with any online or offline RL algorithm.
1
3
14
@MerlinNoth79247
Mian Wu
6 days
Formally, satisfying all rubrics is equivalent to a minimum over rubrics, which yields a min–max objective:
1
3
11
@MerlinNoth79247
Mian Wu
6 days
RLAC takes a third approach. For each answer, a critic tries to pinpoint the most likely mistake as a check (what we call a rubric in the paper), and an external validator tests that check. If the check fails, the critic is rewarded. If it passes, the generator is rewarded.
1
1
12
@MerlinNoth79247
Mian Wu
6 days
Why is this hard? In free-form tasks (long outputs, code, math proofs), an answer may satisfy many hidden checks. Checking all of them is expensive, so RL post-training either enumerates checks (accurate but slow) or uses one learned reward score (cheap but easy to game).
1
2
18
@setlur_amrith
Amrith Setlur
5 months
Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over correct responses sampled by the pretrained LLM OR is it exploring and discovering new strategies 🤔? Find answers in our latest post ⬇️ https://t.co/WCEq3K4dB0
Tweet card summary image
pinnate-flare-8f3.notion.site
Amrith Setlur and Aviral Kumar, Carnegie Mellon University
2
30
154
@GraceLiu78
Grace Liu
24 days
NEW PAPER: "CaRT: Teaching LLM Agents to Know When They Know Enough"! LLMs often overthink, ask too many questions, or waste compute. We introduce Counterfactuals and Reasoning for Termination (CaRT) - teaching LLMs when to stop gathering info and make decisions. 🧵[1/9]
1
12
36
@aviral_kumar2
Aviral Kumar
5 months
Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️
2
32
182
@svlevine
Sergey Levine
15 days
Now we can have self-refinement for VLAs in the real world (with the aid of a big VLM)! VLM critiques VLA rollouts and iteratively refines the commands to make it perform better.
@ameeshsh
Ameesh Shah
15 days
LLMs have shown a remarkable ability to “self-refine” and learn from their mistakes via in-context learning. But in robotics, most methods are single-shot. How can we bring inference-time adaptation to robot learning? A 🧵:
5
25
246
@CVPR
#CVPR2026
28 days
The #CVPR2026 submission portal is now open!
2
16
102
@seohong_park
Seohong Park
1 month
Introducing *dual representations*! tl;dr: We represent a state by the "set of similarities" to all other states. This dual perspective has lots of nice properties and practical benefits in RL. Blog post: https://t.co/lw1PortD9E Paper: https://t.co/zYKFjyOy7C
14
98
790
@svlevine
Sergey Levine
2 months
A new VLA for navigation that can take in goal images, positions, and language, and exhibits some pretty neat emergent language following!
@NoriakiHirose
noriaki_hirose
2 months
We trained OmniVLA, a robotic foundation model for navigation conditioned on language, goal poses, and images. Initialized with OpenVLA, it leverages Internet-scale knowledge for strong OOD performance. Great collaboration with @CatGlossop, @shahdhruv_, and @svlevine.
6
47
374
@kvablack
Kevin Black
2 months
@physical_int
Physical Intelligence
2 months
We've added pi-05 to the openpi repo: pi05-base, pi05-droid, pi05-libero. Also added PyTorch training code!🔥 Instructions and code here: https://t.co/EOhNYfpq9B This is an updated version of the model we showed cleaning kitchens and bedrooms in April:
7
5
162
@svlevine
Sergey Levine
3 months
Language following is a tough problem for VLAs: while these models can follow complex language, in practice getting datasets that enable language following is hard. We developed a method to counterfactually and automatically label data to improve language following! 🧵👇
8
69
417
@sewon__min
Sewon Min
3 months
Thanks for the invite! Excited to be presenting our work on training MoE over distributed data next Monday!
@jyo_pari
Jyo Pari
3 months
We have a fun collaboration of @GPU_MODE x @scaleml coming up! We’re hosting a week-long online bootcamp that explores the core components of GPT-OSS while also diving into cutting-edge research that pushes beyond what’s currently in GPT-OSS! For example, how can MoE's power
1
6
117
@aviral_kumar2
Aviral Kumar
5 months
Given the confusion around what RL does for reasoning in LLMs, @setlur_amrith & I wrote a new blog post on when RL simply sharpens the base model & when it discovers new reasoning strategies. Learn how to measure discovery + methods to enable it ⬇️ https://t.co/Ax7zEiy2DT
Tweet card summary image
pinnate-flare-8f3.notion.site
Amrith Setlur and Aviral Kumar, Carnegie Mellon University
4
41
274