Fengzhuo Zhang @FengzhuoZhang X Profile

Fengzhuo Zhang

@FengzhuoZhang

Followers

135

Following

31

Media

18

Statuses

48

ECE PhD student @NUSingapore | Previous: EE undergrad @Tsinghua_Uni

Joined September 2021

Don't wanna be here? Send us removal request.

Fengzhuo Zhang

@FengzhuoZhang

1 month

Why does Muon outperform Adam—and how? 🚀Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning Three Key Findings: > Associative memory parameters are the main beneficiaries of Muon, compared to Adam. > Muon yields more isotropic weights than Adam. > In

1

32

51

Yu Huang

@yuhuang42

6 days

Excited to share our recent work! We provide a mechanistic understanding of long CoT reasoning in state-tracking: when do transformers length-generalize strongly, when they stall, and how recursive self-training pushes the boundary. 🧵(1/8)

5

42

218

Jinjie Ni

@NiJinjie

12 days

1/3 🚬 Ready to smell your GPUs burning? Introducing MegaDLMs, the first production-level library for training diffusion language models, offering 3× faster training speed and up to 47% MFU. Empowered by Megatron-LM and Transformer-Engine, it offers near-perfect linear

5

42

149

Yu-Xiang Wang

@yuxiangw_cs

24 days

🚀 We just set a new SOTA for LLM inference acceleration with speculative decoding. By corralling a band of specialist drafters, we got 4.99× on Llama-3.1-8B-Instruct, 4.93× on Qwen-32B — beating EAGLE3 by nearly 2x. No gimmicks. Just careful math + solid engineering. 🧵1/

13

52

323

Zhuoran Yang

@zhuoran_yang

29 days

Imagine a research paradigm where nascent ideas evolve into fully realized papers, complete with empirical data, insightful figures, and robust citations, through an iterative, feedback-driven autonomous system. This vision guides our work. We introduce **freephdlabor**: a

3

11

34

Min Lin

@mavenlin

1 month

LLMs don't need MCPs, they need a terminal. Not the bash/shell tool that the codex/claude are already using, but a real tty emulator, to be used in the same way that humans do, i.e. capable of running any REPL interactively, as we will show in the thread.

7

16

46

Fengzhuo Zhang

@FengzhuoZhang

1 month

🎆Huge thanks to my amazing collaborators: @ShucheW94950, @Jason_JX_Li, @ducx_du, @duchao0726,@TianyuPang1,@zhuoran_yang, @Mingyi552237, @vyftan

0

1

0

Fengzhuo Zhang

@FengzhuoZhang

1 month

5/5: Conclusion In summary, Muon's update rule is beautifully aligned with the outer-product structure of associative memories. This makes it a superior choice for learning the heavy-tailed knowledge stored in LLMs.

1

0

1

Fengzhuo Zhang

@FengzhuoZhang

1 month

4/5: The Theory In a one-layer associative memory model, we show: > Muon: Achieves balanced learning with isotropic updates, regardless of feature structure. It's robust. > Adam: Performance is fragile. It can be great or terrible, depending heavily on the underlying embedding

1

0

Fengzhuo Zhang

@FengzhuoZhang

1 month

3/5: The Payoff: Mastering the Long Tail This is where it gets practical. Real-world knowledge is heavy-tailed. >Head (Common facts): Both Muon and Adam do great. >Tail (Rare facts): Muon is a game-changer. It learns rare information significantly faster and better.

1

0

Fengzhuo Zhang

@FengzhuoZhang

1 month

2/5: The "How": Isotropic Weights Muon's secret weapon is creating more balanced, "democratic" weight matrices. It consistently learns more isotropic weights, distributing "learning energy" evenly, than Adam.

1

0

Fengzhuo Zhang

@FengzhuoZhang

1 month

1/5: The Beneficiaries Where does Muon's magic happen? Not everywhere. Muon overwhelmingly benefits the Transformer's "memory banks": the Value/Output (VO) attention weights & Feed-Forward Networks (FFNs). Applying Muon just to these parts recovers most of the full performance

1

0

Shenao Zhang

@ShenaoZhang

2 months

🚀Excited to share our recent research:🚀 “Learning to Reason as Action Abstractions with Scalable Mid-Training RL” We theoretically study 𝙝𝙤𝙬 𝙢𝙞𝙙-𝙩𝙧𝙖𝙞𝙣𝙞𝙣𝙜 𝙨𝙝𝙖𝙥𝙚𝙨 𝙥𝙤𝙨𝙩-𝙩𝙧𝙖𝙞𝙣𝙞𝙣𝙜 𝙍𝙇. The findings lead to a scalable algorithm for learning action

7

66

400

Jinjie Ni

@NiJinjie

2 months

Announcing OpenMoE 2, the first-ever architectural study of sparse diffusion language models, trained from scratch. ✅ Expert-choice MoE × diffusion ✅ Ultra-wide FLOPs/param range (sparse → super-dense) ✅ Perfect load-balance (no aux loss) ✅ +20% throughput ✅ adaptive

6

70

359

Jinjie Ni

@NiJinjie

2 months

🍷Imagine you are the boss of Google DeepMind. To train the best diffusion language model in world within 1 year, using 800 TPU pods, which model size will you go for? 🐿️ We build Quokka to help you decide–the first-ever large-scale scaling law for DLMs. Interesting facts: 1.

6

58

287

Jinjie Ni

@NiJinjie

3 months

Token crisis: solved. ✅ We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch — up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3× data potential. > A 1B DLM trained on just 1B tokens

42

247

2K

Yunlong

@Yunlong_Hou_

4 months

🚀Thrilled to introduce BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms #ICML2025 A bandit approach that further boosts the throughput of speculative decoding by adaptively choosing the hyperparameters! Training-free with Theoretical Guarantees! 13 % / +19 %

1

4

Xingyao Li

@xingyaoL777

4 months

🚀 Absolutely thrilled to share that our paper "Enhancing Long Video Consistency without Tuning" has been selected as Best Paper at the ICML ’25 Workshop on Building Physically Plausible World Models! Training-free TiARA + PromptBlend achieves a major leap in long-video

3

4

Zhuoran Yang

@zhuoran_yang

5 months

🚀 We're excited to share our paper, "Taming Polysemanticity in LLMs," which introduces Group Bias Adaptation (GBA)—the FIRST Sparse Autoencoder (SAE) training method with a provable guarantee for untangling monosemantic concepts! 📄 Paper: https://t.co/f3L3VxEnHn 🌐 Website:

5

26

109

Penghui Qi

@QPHutu

6 months

👀Optimizing Anytime Reasoning via Budget Relative Policy Optimization👀 🚀Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.🚀 📰Paper: https://t.co/Sm5HOB0pnx 🛠️Code: https://t.co/vaxFvNiDJY

2

24

77