Mian Wu @MerlinNoth79247 X Profile

Mian Wu

@MerlinNoth79247

Followers

194

Following

94

Media

6

Statuses

24

Reinforcement learning && LLM | Research at Robotics AI & Learning Lab @berkeley_ai | https://t.co/Itq6wFMmoL

Joined May 2025

Don't wanna be here? Send us removal request.

Mian Wu

@MerlinNoth79247

7 days

Can we run RL to train LLMs on hard-to-verify or open-ended tasks? Even when tasks are verifiable, it is often impossible to check every design detail or catch all mistakes.. We can go prompt-tune LLM judges, but is that really the answer? Our new paper introduces RLAC: a

9

57

338

Richard Sutton

@RichardSSutton

2 months

My acceptance speech at the Turing award ceremony: Good evening ladies and gentlemen. The main idea of reinforcement learning is that a machine might discover what to do on its own, without being told, from its own experience, by trial and error. As far as I know, the first

62

225

2K

Mian Wu

@MerlinNoth79247

6 days

Check out our paper for more results and discussions! Thanks to my amazing advisors @aviral_kumar2 @svlevine @sewon__min for all their guidance and support! Website: https://t.co/RwIkIMtjog Paper:

arxiv.org

Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification...

0

3

17

Mian Wu

@MerlinNoth79247

6 days

RLAC also works well on code generation tasks, consistently achieving better performance and requiring fewer testcase executions than both enumerative and reward-model approaches on most benchmarks.

1

2

9

Mian Wu

@MerlinNoth79247

6 days

We evaluate RLAC on factual text generation (concise biographies). It produces more accurate outputs while using far fewer validator calls, up to around 5–6× fewer checks for longer biographies.

1

3

10

Mian Wu

@MerlinNoth79247

6 days

Here’s how we train RLAC in practice. The generator produces answers, the critic proposes checks, the validator labels them, and we update both models. This cycle can work with any online or offline RL algorithm.

1

3

14

Mian Wu

@MerlinNoth79247

6 days

Formally, satisfying all rubrics is equivalent to a minimum over rubrics, which yields a min–max objective:

1

3

11

Mian Wu

@MerlinNoth79247

6 days

RLAC takes a third approach. For each answer, a critic tries to pinpoint the most likely mistake as a check (what we call a rubric in the paper), and an external validator tests that check. If the check fails, the critic is rewarded. If it passes, the generator is rewarded.

1

12

Mian Wu

@MerlinNoth79247

6 days

Why is this hard? In free-form tasks (long outputs, code, math proofs), an answer may satisfy many hidden checks. Checking all of them is expensive, so RL post-training either enumerates checks (accurate but slow) or uses one learned reward score (cheap but easy to game).

1

2

18

Amrith Setlur

@setlur_amrith

5 months

Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over correct responses sampled by the pretrained LLM OR is it exploring and discovering new strategies 🤔? Find answers in our latest post ⬇️ https://t.co/WCEq3K4dB0

pinnate-flare-8f3.notion.site

Amrith Setlur and Aviral Kumar, Carnegie Mellon University

2

30

154

Grace Liu

@GraceLiu78

24 days

NEW PAPER: "CaRT: Teaching LLM Agents to Know When They Know Enough"! LLMs often overthink, ask too many questions, or waste compute. We introduce Counterfactuals and Reasoning for Termination (CaRT) - teaching LLMs when to stop gathering info and make decisions. 🧵[1/9]

1

12

36

Aviral Kumar

@aviral_kumar2

5 months

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️

2

32

182

Sergey Levine

@svlevine

15 days

Now we can have self-refinement for VLAs in the real world (with the aid of a big VLM)! VLM critiques VLA rollouts and iteratively refines the commands to make it perform better.

Ameesh Shah

@ameeshsh

15 days

LLMs have shown a remarkable ability to “self-refine” and learn from their mistakes via in-context learning. But in robotics, most methods are single-shot. How can we bring inference-time adaptation to robot learning? A 🧵:

5

25

246

#CVPR2026

@CVPR

28 days

The #CVPR2026 submission portal is now open!

2

16

102

Seohong Park

@seohong_park

1 month

Introducing *dual representations*! tl;dr: We represent a state by the "set of similarities" to all other states. This dual perspective has lots of nice properties and practical benefits in RL. Blog post: https://t.co/lw1PortD9E Paper: https://t.co/zYKFjyOy7C ↓

14

98

790

Sergey Levine

@svlevine

2 months

A new VLA for navigation that can take in goal images, positions, and language, and exhibits some pretty neat emergent language following!

noriaki_hirose

@NoriakiHirose

2 months

We trained OmniVLA, a robotic foundation model for navigation conditioned on language, goal poses, and images. Initialized with OpenVLA, it leverages Internet-scale knowledge for strong OOD performance. Great collaboration with @CatGlossop, @shahdhruv_, and @svlevine.

6

47

374

Kevin Black

@kvablack

2 months

https://t.co/NHYTeLhhsx

Physical Intelligence

@physical_int

2 months

We've added pi-05 to the openpi repo: pi05-base, pi05-droid, pi05-libero. Also added PyTorch training code!🔥 Instructions and code here: https://t.co/EOhNYfpq9B This is an updated version of the model we showed cleaning kitchens and bedrooms in April:

7

5

162

Sergey Levine

@svlevine

3 months

Language following is a tough problem for VLAs: while these models can follow complex language, in practice getting datasets that enable language following is hard. We developed a method to counterfactually and automatically label data to improve language following! 🧵👇

8

69

417

Sewon Min

@sewon__min

3 months

Thanks for the invite! Excited to be presenting our work on training MoE over distributed data next Monday!

Jyo Pari

@jyo_pari

3 months

We have a fun collaboration of @GPU_MODE x @scaleml coming up! We’re hosting a week-long online bootcamp that explores the core components of GPT-OSS while also diving into cutting-edge research that pushes beyond what’s currently in GPT-OSS! For example, how can MoE's power

1

6

117

Aviral Kumar

@aviral_kumar2

5 months

Given the confusion around what RL does for reasoning in LLMs, @setlur_amrith & I wrote a new blog post on when RL simply sharpens the base model & when it discovers new reasoning strategies. Learn how to measure discovery + methods to enable it ⬇️ https://t.co/Ax7zEiy2DT

pinnate-flare-8f3.notion.site

Amrith Setlur and Aviral Kumar, Carnegie Mellon University

4

41

274