Fengzhuo Zhang
@FengzhuoZhang
Followers
135
Following
31
Media
18
Statuses
48
ECE PhD student @NUSingapore | Previous: EE undergrad @Tsinghua_Uni
Joined September 2021
Why does Muon outperform Adamโand how? ๐Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning Three Key Findings: > Associative memory parameters are the main beneficiaries of Muon, compared to Adam. > Muon yields more isotropic weights than Adam. > In
1
32
51
Excited to share our recent work! We provide a mechanistic understanding of long CoT reasoning in state-tracking: when do transformers length-generalize strongly, when they stall, and how recursive self-training pushes the boundary. ๐งต(1/8)
5
42
218
1/3 ๐ฌ Ready to smell your GPUs burning? Introducing MegaDLMs, the first production-level library for training diffusion language models, offering 3ร faster training speedย and up toย 47% MFU. Empowered by Megatron-LM and Transformer-Engine, it offers near-perfect linear
5
42
149
๐ We just set a new SOTA for LLM inference acceleration with speculative decoding. By corralling a band of specialist drafters, we got 4.99ร on Llama-3.1-8B-Instruct, 4.93ร on Qwen-32B โ beating EAGLE3 by nearly 2x. No gimmicks. Just careful math + solid engineering. ๐งต1/
13
52
323
Imagine a research paradigm where nascent ideas evolve into fully realized papers, complete with empirical data, insightful figures, and robust citations, through an iterative, feedback-driven autonomous system. This vision guides our work. We introduce **freephdlabor**: a
3
11
34
LLMs don't need MCPs, they need a terminal. Not the bash/shell tool that the codex/claude are already using, but a real tty emulator, to be used in the same way that humans do, i.e. capable of running any REPL interactively, as we will show in the thread.
7
16
46
๐Huge thanks to my amazing collaborators: @ShucheW94950, @Jason_JX_Li, @ducx_du, @duchao0726,@TianyuPang1,@zhuoran_yang, @Mingyi552237, @vyftan
0
1
0
5/5: Conclusion In summary, Muon's update rule is beautifully aligned with the outer-product structure of associative memories. This makes it a superior choice for learning the heavy-tailed knowledge stored in LLMs.
1
0
1
4/5: The Theory In a one-layer associative memory model, we show: > Muon: Achieves balanced learning with isotropic updates, regardless of feature structure. It's robust. > Adam: Performance is fragile. It can be great or terrible, depending heavily on the underlying embedding
1
0
0
3/5: The Payoff: Mastering the Long Tail This is where it gets practical. Real-world knowledge is heavy-tailed. >Head (Common facts): Both Muon and Adam do great. >Tail (Rare facts): Muon is a game-changer. It learns rare information significantly faster and better.
1
0
0
2/5: The "How": Isotropic Weights Muon's secret weapon is creating more balanced, "democratic" weight matrices. It consistently learns more isotropic weights, distributing "learning energy" evenly, than Adam.
1
0
0
1/5: The Beneficiaries Where does Muon's magic happen? Not everywhere. Muon overwhelmingly benefits the Transformer's "memory banks": the Value/Output (VO) attention weights & Feed-Forward Networks (FFNs). Applying Muon just to these parts recovers most of the full performance
1
0
0
๐Excited to share our recent research:๐ โLearning to Reason as Action Abstractions with Scalable Mid-Training RLโ We theoretically study ๐๐ค๐ฌ ๐ข๐๐-๐ฉ๐ง๐๐๐ฃ๐๐ฃ๐ ๐จ๐๐๐ฅ๐๐จ ๐ฅ๐ค๐จ๐ฉ-๐ฉ๐ง๐๐๐ฃ๐๐ฃ๐ ๐๐. The findings lead to a scalable algorithm for learning action
7
66
400
Announcing OpenMoE 2, the first-ever architectural study of sparse diffusion language models, trained from scratch. โ
Expert-choice MoE ร diffusion โ
Ultra-wide FLOPs/param range (sparse โ super-dense) โ
Perfect load-balance (no aux loss) โ
+20% throughput โ
adaptive
6
70
359
๐ทImagine you are the boss of Google DeepMind. To train the best diffusion language model in world within 1 year, using 800 TPU pods, which model size will you go for? ๐ฟ๏ธย We build Quokka to help you decideโthe first-ever large-scale scaling law for DLMs. Interesting facts: 1.
6
58
287
Token crisis: solved. โ
We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch โ up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3ร data potential. > A 1B DLM trained on just 1B tokens
42
247
2K
๐Thrilled to introduce BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms #ICML2025 A bandit approach that further boosts the throughput of speculative decoding by adaptively choosing the hyperparameters! Training-free with Theoretical Guarantees! 13 % / +19 %
1
4
4
๐ Absolutely thrilled to share that our paper "Enhancing Long Video Consistency without Tuning" has been selected as Best Paper at the ICML โ25 Workshop on Building Physically Plausible World Models! Training-free TiARA + PromptBlend achieves a major leap in long-video
3
3
4
๐ We're excited to share our paper, "Taming Polysemanticity in LLMs," which introduces Group Bias Adaptation (GBA)โthe FIRST Sparse Autoencoder (SAE) training method with a provable guarantee for untangling monosemantic concepts! ๐ Paper: https://t.co/f3L3VxEnHn ๐ Website:
5
26
109
๐Optimizing Anytime Reasoning via Budget Relative Policy Optimization๐ ๐Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.๐ ๐ฐPaper: https://t.co/Sm5HOB0pnx ๐ ๏ธCode: https://t.co/vaxFvNiDJY
2
24
77