Matthew Yang @matthewyryang X Profile

Matthew Yang

@matthewyryang

Followers

48

Following

51

Media

7

Statuses

27

MSML student @ CMU

https://t.co/IKBlmrMJ6N

Joined August 2024

Don't wanna be here? Send us removal request.

Andrew Zhao

@_AndrewZhao

2 months

paper of the day

13

32

532

Gautam Kamath

@thegautamkamath

2 months

Anyone who's done a PhD knows the feeling

4

5

128

Zheyuan Hu

@real_ZheyuanHu

2 months

Introducing RaC: A data collection protocol that boosts data efficiency by 10x compared to some of the best imitation results. Key idea: scale recovery & correction data systematically => policies can reset+retry when acting (consistent self-correct) => better performance. 🧵0/N

11

39

210

Aviral Kumar

@aviral_kumar2

2 months

🚨🚨New paper on core RL: a way to train value-functions via flow-matching for scaling compute! No text/images, but a flow directly on a scalar Q-value. This unlocks benefits of iterative compute, test-time scaling for value prediction & SOTA results on whatever we tried. 🧵⬇️

11

83

705

Amrith Setlur

@setlur_amrith

2 months

Nice to see ideas in our e3 paper ( https://t.co/tUAKAqDO05): chaining asymmetries to learn meta-behaviors, also work on didactic tasks!

Lifan Yuan

@lifan__yuan

2 months

🧩New blog: From f(x) and g(x) to f(g(x)): LLMs Learn New Skills in RL by Composing Old Ones Do LLMs learn new skills through RL, or just activate existing patterns? Answer: RL teaches the powerful meta-skill of composition when properly incentivized. 🔗: https://t.co/4Ud8qsYrOT

0

3

23

Wan

@Alibaba_Wan

4 months

🚀 Introducing Wan2.2: The World's First Open-Source MoE-Architecture Video Generation Model with Cinematic Control! 🔥 Key Innovations: ꔷ World's First Open-Source MoE Video Model: Our Mixture-of-Experts architecture scales model capacity without increasing computational

84

311

2K

Alexandr Wang

@alexandr_wang

5 months

I’m excited to be the Chief AI Officer of @Meta, working alongside @natfriedman, and thrilled to be accompanied by an incredible group of people joining on the same day. Towards superintelligence 🚀

1K

2K

23K

Amrith Setlur

@setlur_amrith

5 months

Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over correct responses sampled by the pretrained LLM OR is it exploring and discovering new strategies 🤔? Find answers in our latest post ⬇️ https://t.co/WCEq3K4dB0

pinnate-flare-8f3.notion.site

Amrith Setlur and Aviral Kumar, Carnegie Mellon University

2

30

154

Aviral Kumar

@aviral_kumar2

5 months

Our view on test-time scaling has been to train models to discover algos that enable them to solve harder problems. @setlur_amrith & @matthewyryang's new work e3 shows how RL done with this view produces best <2B LLM on math that extrapolates beyond training budget. 🧵⬇️

2

32

181

Amrith Setlur

@setlur_amrith

5 months

Introducing e3 🔥 Best <2B model on math 💪 Are LLMs implementing algos ⚒️ OR is thinking an illusion 🎩.? Is RL only sharpening the base LLM distrib. 🤔 OR discovering novel strategies outside base LLM 💡? We answer these ⤵️ 🚨 https://t.co/xbLULWYTmM 🚨 https://t.co/xuruZtQ6BA

1

24

96

Matthew Yang

@matthewyryang

5 months

🙏 Work done with an amazing set of collaborators: @setlur_amrith, @sea_snell, @JeremyGreerOumi, @ianwu97, advised by @gingsmith, @max_simchowitz, and @aviral_kumar2! Model: https://t.co/eyd2SfJVF0 Code: https://t.co/lExPApDtRw 🧵[8/8]

github.com

Contribute to ars22/e3 development by creating an account on GitHub.

0

3

Matthew Yang

@matthewyryang

5 months

Without matching prompts to budget: - Too little budget for hard prompts → kills exploration early - Too much budget for easy prompts → over-exploratory behavior Blue: fixed data mixture Green: fixed training budget Black: coupled curriculum (e3) 🧵[7/8]

1

0

1

Matthew Yang

@matthewyryang

5 months

Ingredient #3: Coupled Curriculum To fully unlock in-context exploration, RL must operate in the right mode - not (i) sharpening known responses but in (ii) chaining new ones. This requires coupling the right prompts with the right budget during training. 🧵[6/8]

1

0

Matthew Yang

@matthewyryang

5 months

Ingredient #2: Negative Gradient Chaining leads to in-context exploration, but how can we incentivize it? Here comes the "negative gradient" in RL, which reduces the probability of EOS in favor of continuation and trying new stuff. 🧵[5/8]

1

0

Matthew Yang

@matthewyryang

5 months

Ingredient #1: Asymmetries Asymmetries = differences in competence in base model capabilities (e.g., verification ≠ generation) ✅ Models w. asymmetries learn to explore by chaining them - leading to longer responses ❌ Models w/o. asymmetries do not 🧵[4/8]

1

0

Matthew Yang

@matthewyryang

5 months

Our result? A SOTA < 2B model on AIME and HMMT’25 that extrapolates to 2x the training budget! We teach models to scale their reasoning with test-time compute 📈 using three key ingredients: 🧵[3/8]

1

0

Matthew Yang

@matthewyryang

5 months

The ultimate promise of test-time scaling is extrapolation: the ability of LLMs to improve as they reason for longer than they were trained. Most open-source models flat-line when test-time compute increases - more tokens, same performance 💔 they just can’t extrapolate 😔

1

0

Matthew Yang

@matthewyryang

5 months

🚨 NEW PAPER: What if LLMs could tackle harder problems - not by explicitly training on longer traces, but by learning how to think longer? Our recipe e3 teaches models to explore in-context, enabling LLMs to unlock longer reasoning chains without ever seeing them in training.

1

5

10

Quentin Gallouédec

@QGallouedec

7 months

🤔 How do you explain that when we apply RL to math problems, the incorrect answers become longer than the correct ones? We had this discussion this morning, and I'm curious to know what the community thinks about it.

38

20

185

Alex Patrascu

@maxescu

8 months

Be water, my friend

77

270

2K