Quanquan Gu
@QuanquanGu
Followers
17K
Following
36K
Media
134
Statuses
2K
Professor @UCLA, Pretraining and Scaling at ByteDance Seed | Recent work: Build AGI | Opinions are my own
Los Angeles, CA
Joined August 2017
The RPG is out. Make KL-regularized Policy Gradient Correct Again! No more GRPO or Reinforce++ — their objectives and KL regularization are inherently inconsistent.
1/6 We introduce RPG, a principled framework for deriving and analyzing KL-regularized policy gradient methods, unifying GRPO/k3-estimator and REINFORCE++ under this framework and discovering better RL objectives than GRPO: Paper: https://t.co/7xSUj01GIx Code:
2
16
214
This is fantastic. Every university should consider doing the same. Teaching undergraduate students how modern AI systems actually work under the hood and making the fundamentals accessible early on is exactly what the field needs.
I'm teaching a new "Intro to Modern AI" course at CMU this Spring: https://t.co/ptnrNmVPyf. It's an early-undergrad course on how to build a chatbot from scratch (well, from PyTorch). The course name has bothered some people – "AI" usually means something much broader in academic
1
2
81
🔥 Learning rate transfer under μP is now proven!
🎯 Just released a new preprint that proves LR transfer under μP. -> The Problem: When training large neural networks, one of the trickiest questions is: what learning rate should I use? [1/n]🧵 Link: https://t.co/cnYtpfVHpE
2
8
120
Ever wonder why the Chernoff bound feels like magic? A geometric answer: KL divergence loves exponential families. This post shares some reflections — and sets up a series on how KL geometry connects classical statistics, online learning (OCO), and more. https://t.co/nScAqj7l8s
6
45
303
Curious how frameworks like nanochat actually scale? New blog post: Introduction to Parallelism in PyTorch. Covers async DDP, ZeRO-1/2, FSDP, and TP – with implementations from scratch and practical advice from real runs on different hardware. Even if you are experienced,
11
68
689
I always get frustrated when asked what is ML theory good for and people ask for specific examples. I find this question unfair, I think its really just having a theory/mathematical perspective is sometimes super helpful. E.g. Diffusion models and its relatives, I don't see how
No joke. Most people haven’t yet realized how powerful machine learning theory actually is. I’m speaking from the perspective of someone directly building AGI: it stabilizes both pretraining and RL, and it provides the blueprint for scaling all the way to AGI.
12
12
338
No joke. Most people haven’t yet realized how powerful machine learning theory actually is. I’m speaking from the perspective of someone directly building AGI: it stabilizes both pretraining and RL, and it provides the blueprint for scaling all the way to AGI.
14
4
160
Here is another compelling case highlighting why KL-regularized RL is indispensable.
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
7
10
142
wow, only if there was rl algorithms that had (self) distillation term for reverse kld. that everyone trying to remove tldr: replace pi_ref with pi_teacher you get on policy distillation
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
11
14
262
I find it fascinating that momentum in standard convex optimization is just about making convergence faster, but in nonconvex problems, it's sometimes the only way a method can work at all. Just saw a new example of this phenomenon in the case of difference-of-convex functions.
3
15
134
I used ChatGPT to solve an open problem in convex optimization. *Part I* (1/N)
85
356
2K
Very satisfied with some neat results on imitation learning. When distribution matching isn’t possible, what’s even the role of demonstrations? Cloning/log-loss minimization? We propose directly encoding reward structure—motivating new algorithmic ideas.
arxiv.org
We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time....
4
7
66
In this note w/ @beenwrekt we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x: max_θ 𝔼ₓ h(Prob(correct ∣ x; θ)) for certain functions h. Weirdly, h is arcsin(√t) in GRPO.
10
39
360
DLLMs seem promising... but parallel generation is not always possible Diffusion-based LLMs can generate many tokens at different positions at once, while most autoregressive LLMs generate tokens one by one. This makes diffusion-based LLMs highly attractive when we need fast
12
52
332
1/8 Second Order Optimizers like SOAP and Muon have shown impressive performance on LLM optimization. But are we fully utilizing the potential of second order information? New work: we show that a full second order optimizer is much better than existing optimizers in terms of
26
80
595
There's been significant recent progress in the NanoGPT speedrun. Highly recommend this post by @classiclarryd
https://t.co/QNoI4wVAJg
lesswrong.com
In early 2024 Andrej Karpathy stood up an llm.c repo to train GPT-2 (124M), which took an equivalent of 45 minutes on 8xH100 GPUs to reach 3.28 cross…
11
57
514
(1/N) 🚀 Excited to share our new work on inference scaling algorithms! For challenging reasoning tasks, single-shot selection often falls short — even strong models can miss the right answer on their first try. That’s why evaluations typically report Pass@k, where an agent
4
4
12
Excited to announce our NeurIPS ’25 tutorial: Foundations of Imitation Learning: From Language Modeling to Continuous Control With Adam Block & Max Simchowitz (@max_simchowitz)
6
50
359
1/n We introduce MARS-M, which extends our variance reduction framework, MARS, to matrix-based optimizer Muon (Moonlight). MARS-M demonstrates consistent performance gains over Muon in LLM pretraining tasks. Github Repo: https://t.co/vuubxbk3iQ
4
12
67