QuanquanGu Profile Banner
Quanquan Gu Profile
Quanquan Gu

@QuanquanGu

Followers
17K
Following
36K
Media
134
Statuses
2K

Professor @UCLA, Pretraining and Scaling at ByteDance Seed | Recent work: Build AGI | Opinions are my own

Los Angeles, CA
Joined August 2017
Don't wanna be here? Send us removal request.
@QuanquanGu
Quanquan Gu
6 months
The RPG is out. Make KL-regularized Policy Gradient Correct Again! No more GRPO or Reinforce++ — their objectives and KL regularization are inherently inconsistent.
@YIFENGLIU_AI
YIFENG LIU
6 months
1/6 We introduce RPG, a principled framework for deriving and analyzing KL-regularized policy gradient methods, unifying GRPO/k3-estimator and REINFORCE++ under this framework and discovering better RL objectives than GRPO: Paper: https://t.co/7xSUj01GIx Code:
2
16
214
@QuanquanGu
Quanquan Gu
4 days
This is fantastic. Every university should consider doing the same. Teaching undergraduate students how modern AI systems actually work under the hood and making the fundamentals accessible early on is exactly what the field needs.
@zicokolter
Zico Kolter
4 days
I'm teaching a new "Intro to Modern AI" course at CMU this Spring: https://t.co/ptnrNmVPyf. It's an early-undergrad course on how to build a chatbot from scratch (well, from PyTorch). The course name has bothered some people – "AI" usually means something much broader in academic
1
2
81
@QuanquanGu
Quanquan Gu
10 days
🔥 Learning rate transfer under μP is now proven!
@hayou_soufiane
Soufiane Hayou
10 days
🎯 Just released a new preprint that proves LR transfer under μP. -> The Problem: When training large neural networks, one of the trickiest questions is: what learning rate should I use? [1/n]🧵 Link: https://t.co/cnYtpfVHpE
2
8
120
@MadhowSunil
Sunil Madhow
12 days
Ever wonder why the Chernoff bound feels like magic? A geometric answer: KL divergence loves exponential families. This post shares some reflections — and sets up a series on how KL geometry connects classical statistics, online learning (OCO), and more. https://t.co/nScAqj7l8s
6
45
303
@iamgrigorev
George Grigorev
14 days
Curious how frameworks like nanochat actually scale? New blog post: Introduction to Parallelism in PyTorch. Covers async DDP, ZeRO-1/2, FSDP, and TP – with implementations from scratch and practical advice from real runs on different hardware. Even if you are experienced,
11
68
689
@jasondeanlee
Jason Lee
16 days
I always get frustrated when asked what is ML theory good for and people ask for specific examples. I find this question unfair, I think its really just having a theory/mathematical perspective is sometimes super helpful. E.g. Diffusion models and its relatives, I don't see how
@QuanquanGu
Quanquan Gu
16 days
No joke. Most people haven’t yet realized how powerful machine learning theory actually is. I’m speaking from the perspective of someone directly building AGI: it stabilizes both pretraining and RL, and it provides the blueprint for scaling all the way to AGI.
12
12
338
@QuanquanGu
Quanquan Gu
16 days
No joke. Most people haven’t yet realized how powerful machine learning theory actually is. I’m speaking from the perspective of someone directly building AGI: it stabilizes both pretraining and RL, and it provides the blueprint for scaling all the way to AGI.
14
4
160
@QuanquanGu
Quanquan Gu
16 days
Machine learning theory.
@wzhao_nlp
Wenting Zhao
17 days
The question I got asked most frequently during COLM this year was what research questions can be studied in academia that will also be relevant to frontier labs. So I’m making a talk for this. What topics / areas should I cover? RL/eval/pretraining,?
2
6
144
@QuanquanGu
Quanquan Gu
18 days
Here is another compelling case highlighting why KL-regularized RL is indispensable.
@thinkymachines
Thinking Machines
18 days
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
7
10
142
@shxf0072
Joey (e/λ)
18 days
wow, only if there was rl algorithms that had (self) distillation term for reverse kld. that everyone trying to remove tldr: replace pi_ref with pi_teacher you get on policy distillation
@thinkymachines
Thinking Machines
18 days
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
11
14
262
@konstmish
Konstantin Mishchenko
25 days
I find it fascinating that momentum in standard convex optimization is just about making convergence faster, but in nonconvex problems, it's sometimes the only way a method can work at all. Just saw a new example of this phenomenon in the case of difference-of-convex functions.
3
15
134
@ErnestRyu
Ernest Ryu
24 days
I used ChatGPT to solve an open problem in convex optimization. *Part I* (1/N)
85
356
2K
@nirmitj_
Nirmit Joshi
26 days
Very satisfied with some neat results on imitation learning. When distribution matching isn’t possible, what’s even the role of demonstrations? Cloning/log-loss minimization? We propose directly encoding reward structure—motivating new algorithmic ideas.
Tweet card summary image
arxiv.org
We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time....
4
7
66
@CsabaSzepesvari
Csaba Szepesvari
26 days
@karpathy @karpathy I think it would be good to distinguish RL as a problem from the algorithms that people use to address RL problems. This would allow us to discuss if the problem is with the algorithms, or if the problem is with posing a problem as an RL problem. 1/x
9
38
416
@damekdavis
Damek
1 month
In this note w/ @beenwrekt we look at RL problems with 0/1 rewards, showing that popular methods maximize the average (transformed) probability of correctly answering a prompt x: max_θ 𝔼ₓ h(Prob(correct ∣ x; θ)) for certain functions h. Weirdly, h is arcsin(√t) in GRPO.
10
39
360
@Kangwook_Lee
Kangwook Lee
29 days
DLLMs seem promising... but parallel generation is not always possible Diffusion-based LLMs can generate many tokens at different positions at once, while most autoregressive LLMs generate tokens one by one. This makes diffusion-based LLMs highly attractive when we need fast
12
52
332
@ShamKakade6
Sham Kakade
1 month
1/8 Second Order Optimizers like SOAP and Muon have shown impressive performance on LLM optimization. But are we fully utilizing the potential of second order information? New work: we show that a full second order optimizer is much better than existing optimizers in terms of
26
80
595
@di_qiwei
QIWEI DI
1 month
(1/N) 🚀 Excited to share our new work on inference scaling algorithms! For challenging reasoning tasks, single-shot selection often falls short — even strong models can miss the right answer on their first try. That’s why evaluations typically report Pass@k, where an agent
4
4
12
@canondetortugas
Dylan Foster 🐢
1 month
Excited to announce our NeurIPS ’25 tutorial: Foundations of Imitation Learning: From Language Modeling to Continuous Control With Adam Block & Max Simchowitz (@max_simchowitz)
6
50
359
@YIFENGLIU_AI
YIFENG LIU
1 month
1/n We introduce MARS-M, which extends our variance reduction framework, MARS, to matrix-based optimizer Muon (Moonlight). MARS-M demonstrates consistent performance gains over Muon in LLM pretraining tasks. Github Repo: https://t.co/vuubxbk3iQ
4
12
67