Quanquan Gu @QuanquanGu X Profile

Quanquan Gu

@QuanquanGu

Followers

17K

Following

36K

Media

134

Statuses

2K

Professor @UCLA, Pretraining and Scaling at ByteDance Seed | Recent work: Build AGI | Opinions are my own

https://t.co/OUkO2fFweH

Los Angeles, CA

Joined August 2017

Don't wanna be here? Send us removal request.

Quanquan Gu

@QuanquanGu

6 months

The RPG is out. Make KL-regularized Policy Gradient Correct Again! No more GRPO or Reinforce++ — their objectives and KL regularization are inherently inconsistent.

YIFENG LIU

@YIFENGLIU_AI

6 months

1/6 We introduce RPG, a principled framework for deriving and analyzing KL-regularized policy gradient methods, unifying GRPO/k3-estimator and REINFORCE++ under this framework and discovering better RL objectives than GRPO: Paper: https://t.co/7xSUj01GIx Code:

2

16

214

Quanquan Gu

@QuanquanGu

4 days

This is fantastic. Every university should consider doing the same. Teaching undergraduate students how modern AI systems actually work under the hood and making the fundamentals accessible early on is exactly what the field needs.

Zico Kolter

@zicokolter

4 days

I'm teaching a new "Intro to Modern AI" course at CMU this Spring: https://t.co/ptnrNmVPyf. It's an early-undergrad course on how to build a chatbot from scratch (well, from PyTorch). The course name has bothered some people – "AI" usually means something much broader in academic

1

2

81

Quanquan Gu

@QuanquanGu

10 days

🔥 Learning rate transfer under μP is now proven!

Soufiane Hayou

@hayou_soufiane

10 days

🎯 Just released a new preprint that proves LR transfer under μP. -> The Problem: When training large neural networks, one of the trickiest questions is: what learning rate should I use? [1/n]🧵 Link: https://t.co/cnYtpfVHpE

2

8

120

Sunil Madhow

@MadhowSunil

12 days

Ever wonder why the Chernoff bound feels like magic? A geometric answer: KL divergence loves exponential families. This post shares some reflections — and sets up a series on how KL geometry connects classical statistics, online learning (OCO), and more. https://t.co/nScAqj7l8s

6

45

303

George Grigorev

@iamgrigorev

14 days

Curious how frameworks like nanochat actually scale? New blog post: Introduction to Parallelism in PyTorch. Covers async DDP, ZeRO-1/2, FSDP, and TP – with implementations from scratch and practical advice from real runs on different hardware. Even if you are experienced,

11

68

689

Jason Lee

@jasondeanlee

16 days

I always get frustrated when asked what is ML theory good for and people ask for specific examples. I find this question unfair, I think its really just having a theory/mathematical perspective is sometimes super helpful. E.g. Diffusion models and its relatives, I don't see how

Quanquan Gu

@QuanquanGu

16 days

No joke. Most people haven’t yet realized how powerful machine learning theory actually is. I’m speaking from the perspective of someone directly building AGI: it stabilizes both pretraining and RL, and it provides the blueprint for scaling all the way to AGI.

12

338

Quanquan Gu

@QuanquanGu

16 days

No joke. Most people haven’t yet realized how powerful machine learning theory actually is. I’m speaking from the perspective of someone directly building AGI: it stabilizes both pretraining and RL, and it provides the blueprint for scaling all the way to AGI.

14

4

160

Quanquan Gu

@QuanquanGu

16 days

Machine learning theory.

Wenting Zhao

@wzhao_nlp

17 days

The question I got asked most frequently during COLM this year was what research questions can be studied in academia that will also be relevant to frontier labs. So I’m making a talk for this. What topics / areas should I cover? RL/eval/pretraining,?

2

6

144

Quanquan Gu

@QuanquanGu

18 days

Here is another compelling case highlighting why KL-regularized RL is indispensable.

Thinking Machines

@thinkymachines

18 days

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other

7

10

142

Joey (e/λ)

@shxf0072

18 days

wow, only if there was rl algorithms that had (self) distillation term for reverse kld. that everyone trying to remove tldr: replace pi_ref with pi_teacher you get on policy distillation

Thinking Machines

@thinkymachines

18 days

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other

11

14

262

Konstantin Mishchenko

@konstmish

25 days

I find it fascinating that momentum in standard convex optimization is just about making convergence faster, but in nonconvex problems, it's sometimes the only way a method can work at all. Just saw a new example of this phenomenon in the case of difference-of-convex functions.

3

15

134

Ernest Ryu

@ErnestRyu

24 days

I used ChatGPT to solve an open problem in convex optimization. *Part I* (1/N)

85

356

2K

Nirmit Joshi

@nirmitj_

26 days

Very satisfied with some neat results on imitation learning. When distribution matching isn’t possible, what’s even the role of demonstrations? Cloning/log-loss minimization? We propose directly encoding reward structure—motivating new algorithmic ideas.

arxiv.org

We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time....

4

7

66

Csaba Szepesvari

@CsabaSzepesvari

26 days

@karpathy @karpathy I think it would be good to distinguish RL as a problem from the algorithms that people use to address RL problems. This would allow us to discuss if the problem is with the algorithms, or if the problem is with posing a problem as an RL problem. 1/x

9

38

416

Damek

@damekdavis

1 month

In this note w/ @beenwrekt we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x: max_θ 𝔼ₓ h(Prob(correct ∣ x; θ)) for certain functions h. Weirdly, h is arcsin(√t) in GRPO.

10

39

360

Kangwook Lee

@Kangwook_Lee

29 days

DLLMs seem promising... but parallel generation is not always possible Diffusion-based LLMs can generate many tokens at different positions at once, while most autoregressive LLMs generate tokens one by one. This makes diffusion-based LLMs highly attractive when we need fast

12

52

332

Sham Kakade

@ShamKakade6

1 month

1/8 Second Order Optimizers like SOAP and Muon have shown impressive performance on LLM optimization. But are we fully utilizing the potential of second order information? New work: we show that a full second order optimizer is much better than existing optimizers in terms of

26

80

595

Keller Jordan

@kellerjordan0

1 month

There's been significant recent progress in the NanoGPT speedrun. Highly recommend this post by @classiclarryd https://t.co/QNoI4wVAJg

lesswrong.com

In early 2024 Andrej Karpathy stood up an llm.c repo to train GPT-2 (124M), which took an equivalent of 45 minutes on 8xH100 GPUs to reach 3.28 cross…

11

57

514

QIWEI DI

@di_qiwei

1 month

(1/N) 🚀 Excited to share our new work on inference scaling algorithms! For challenging reasoning tasks, single-shot selection often falls short — even strong models can miss the right answer on their first try. That’s why evaluations typically report Pass@k, where an agent

4

12

Dylan Foster 🐢

@canondetortugas

1 month

Excited to announce our NeurIPS ’25 tutorial: Foundations of Imitation Learning: From Language Modeling to Continuous Control With Adam Block & Max Simchowitz (@max_simchowitz)

6

50

359

YIFENG LIU

@YIFENGLIU_AI

1 month

1/n We introduce MARS-M, which extends our variance reduction framework, MARS, to matrix-based optimizer Muon (Moonlight). MARS-M demonstrates consistent performance gains over Muon in LLM pretraining tasks. Github Repo: https://t.co/vuubxbk3iQ

4

12

67