Matthew Yang
@matthewyryang
Followers
47
Following
51
Media
7
Statuses
27
MSML student @ CMU
Joined August 2024
Introducing RaC: A data collection protocol that boosts data efficiency by 10x compared to some of the best imitation results. Key idea: scale recovery & correction data systematically => policies can reset+retry when acting (consistent self-correct) => better performance. 🧵0/N
11
38
210
🚨🚨New paper on core RL: a way to train value-functions via flow-matching for scaling compute! No text/images, but a flow directly on a scalar Q-value. This unlocks benefits of iterative compute, test-time scaling for value prediction & SOTA results on whatever we tried. 🧵⬇️
11
83
708
Nice to see ideas in our e3 paper ( https://t.co/tUAKAqDO05): chaining asymmetries to learn meta-behaviors, also work on didactic tasks!
🧩New blog: From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones Do LLMs learn new skills through RL, or just activate existing patterns? Answer: RL teaches the powerful meta-skill of composition when properly incentivized. 🔗: https://t.co/4Ud8qsYrOT
0
3
23
🚀 Introducing Wan2.2: The World's First Open-Source MoE-Architecture Video Generation Model with Cinematic Control! 🔥 Key Innovations: ꔷ World's First Open-Source MoE Video Model: Our Mixture-of-Experts architecture scales model capacity without increasing computational
84
311
2K
I’m excited to be the Chief AI Officer of @Meta, working alongside @natfriedman, and thrilled to be accompanied by an incredible group of people joining on the same day. Towards superintelligence 🚀
1K
2K
23K
Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over correct responses sampled by the pretrained LLM OR is it exploring and discovering new strategies 🤔? Find answers in our latest post ⬇️ https://t.co/WCEq3K4dB0
pinnate-flare-8f3.notion.site
Amrith Setlur and Aviral Kumar, Carnegie Mellon University
2
30
154
Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️
2
32
182
Introducing e3 🔥 Best <2B model on math 💪 Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️ 🚨 https://t.co/xbLULWYTmM 🚨 https://t.co/xuruZtQ6BA
1
24
96
🙏 Work done with an amazing set of collaborators: @setlur_amrith, @sea_snell, @JeremyGreerOumi, @ianwu97, advised by @gingsmith, @max_simchowitz, and @aviral_kumar2! Model: https://t.co/eyd2SfJVF0 Code: https://t.co/lExPApDtRw 🧵[8/8]
github.com
Contribute to ars22/e3 development by creating an account on GitHub.
0
0
3
Without matching prompts to budget: - Too little budget for hard prompts → kills exploration early - Too much budget for easy prompts → over-exploratory behavior Blue: fixed data mixture Green: fixed training budget Black: coupled curriculum (e3) 🧵[7/8]
1
0
1
Ingredient #3: Coupled Curriculum To fully unlock in-context exploration, RL must operate in the right mode - not (i) sharpening known responses but in (ii) chaining new ones. This requires coupling the right prompts with the right budget during training. 🧵[6/8]
1
0
0
Ingredient #2: Negative Gradient Chaining leads to in-context exploration, but how can we incentivize it? Here comes the "negative gradient" in RL, which reduces the probability of EOS in favor of continuation and trying new stuff. 🧵[5/8]
1
0
0
Ingredient #1: Asymmetries Asymmetries = differences in competence in base model capabilities (e.g., verification ≠ generation) ✅ Models w. asymmetries learn to explore by chaining them - leading to longer responses ❌ Models w/o. asymmetries do not 🧵[4/8]
1
0
0
Our result? A SOTA < 2B model on AIME and HMMT’25 that extrapolates to 2x the training budget! We teach models to scale their reasoning with test-time compute 📈 using three key ingredients: 🧵[3/8]
1
0
0
The ultimate promise of test-time scaling is extrapolation: the ability of LLMs to improve as they reason for longer than they were trained. Most open-source models flat-line when test-time compute increases - more tokens, same performance 💔 they just can’t extrapolate 😔
1
0
0
🚨 NEW PAPER: What if LLMs could tackle harder problems - not by explicitly training on longer traces, but by learning how to think longer? Our recipe e3 teaches models to explore in-context, enabling LLMs to unlock longer reasoning chains without ever seeing them in training.
1
5
10
🤔 How do you explain that when we apply RL to math problems, the incorrect answers become longer than the correct ones? We had this discussion this morning, and I'm curious to know what the community thinks about it.
38
20
185