Fengzhuo Zhang Profile
Fengzhuo Zhang

@FengzhuoZhang

Followers
135
Following
31
Media
18
Statuses
48

ECE PhD student @NUSingapore | Previous: EE undergrad @Tsinghua_Uni

Joined September 2021
Don't wanna be here? Send us removal request.
@FengzhuoZhang
Fengzhuo Zhang
1 month
Why does Muon outperform Adamโ€”and how? ๐Ÿš€Answer: Muon Outperforms Adam in Tail-End Associative Memory Learning Three Key Findings: > Associative memory parameters are the main beneficiaries of Muon, compared to Adam. > Muon yields more isotropic weights than Adam. > In
1
32
51
@yuhuang42
Yu Huang
6 days
Excited to share our recent work! We provide a mechanistic understanding of long CoT reasoning in state-tracking: when do transformers length-generalize strongly, when they stall, and how recursive self-training pushes the boundary. ๐Ÿงต(1/8)
5
42
218
@NiJinjie
Jinjie Ni
12 days
1/3 ๐Ÿšฌ Ready to smell your GPUs burning? Introducing MegaDLMs, the first production-level library for training diffusion language models, offering 3ร— faster training speedย and up toย 47% MFU. Empowered by Megatron-LM and Transformer-Engine, it offers near-perfect linear
5
42
149
@yuxiangw_cs
Yu-Xiang Wang
24 days
๐Ÿš€ We just set a new SOTA for LLM inference acceleration with speculative decoding. By corralling a band of specialist drafters, we got 4.99ร— on Llama-3.1-8B-Instruct, 4.93ร— on Qwen-32B โ€” beating EAGLE3 by nearly 2x. No gimmicks. Just careful math + solid engineering. ๐Ÿงต1/
13
52
323
@zhuoran_yang
Zhuoran Yang
29 days
Imagine a research paradigm where nascent ideas evolve into fully realized papers, complete with empirical data, insightful figures, and robust citations, through an iterative, feedback-driven autonomous system. This vision guides our work. We introduce **freephdlabor**: a
3
11
34
@mavenlin
Min Lin
1 month
LLMs don't need MCPs, they need a terminal. Not the bash/shell tool that the codex/claude are already using, but a real tty emulator, to be used in the same way that humans do, i.e. capable of running any REPL interactively, as we will show in the thread.
7
16
46
@FengzhuoZhang
Fengzhuo Zhang
1 month
๐ŸŽ†Huge thanks to my amazing collaborators: @ShucheW94950, @Jason_JX_Li, @ducx_du, @duchao0726,@TianyuPang1,@zhuoran_yang, @Mingyi552237, @vyftan
0
1
0
@FengzhuoZhang
Fengzhuo Zhang
1 month
5/5: Conclusion In summary, Muon's update rule is beautifully aligned with the outer-product structure of associative memories. This makes it a superior choice for learning the heavy-tailed knowledge stored in LLMs.
1
0
1
@FengzhuoZhang
Fengzhuo Zhang
1 month
4/5: The Theory In a one-layer associative memory model, we show: > Muon: Achieves balanced learning with isotropic updates, regardless of feature structure. It's robust. > Adam: Performance is fragile. It can be great or terrible, depending heavily on the underlying embedding
1
0
0
@FengzhuoZhang
Fengzhuo Zhang
1 month
3/5: The Payoff: Mastering the Long Tail This is where it gets practical. Real-world knowledge is heavy-tailed. >Head (Common facts): Both Muon and Adam do great. >Tail (Rare facts): Muon is a game-changer. It learns rare information significantly faster and better.
1
0
0
@FengzhuoZhang
Fengzhuo Zhang
1 month
2/5: The "How": Isotropic Weights Muon's secret weapon is creating more balanced, "democratic" weight matrices. It consistently learns more isotropic weights, distributing "learning energy" evenly, than Adam.
1
0
0
@FengzhuoZhang
Fengzhuo Zhang
1 month
1/5: The Beneficiaries Where does Muon's magic happen? Not everywhere. Muon overwhelmingly benefits the Transformer's "memory banks": the Value/Output (VO) attention weights & Feed-Forward Networks (FFNs). Applying Muon just to these parts recovers most of the full performance
1
0
0
@ShenaoZhang
Shenao Zhang
2 months
๐Ÿš€Excited to share our recent research:๐Ÿš€ โ€œLearning to Reason as Action Abstractions with Scalable Mid-Training RLโ€ We theoretically study ๐™๐™ค๐™ฌ ๐™ข๐™ž๐™™-๐™ฉ๐™ง๐™–๐™ž๐™ฃ๐™ž๐™ฃ๐™œ ๐™จ๐™๐™–๐™ฅ๐™š๐™จ ๐™ฅ๐™ค๐™จ๐™ฉ-๐™ฉ๐™ง๐™–๐™ž๐™ฃ๐™ž๐™ฃ๐™œ ๐™๐™‡. The findings lead to a scalable algorithm for learning action
7
66
400
@NiJinjie
Jinjie Ni
2 months
Announcing OpenMoE 2, the first-ever architectural study of sparse diffusion language models, trained from scratch. โœ… Expert-choice MoE ร— diffusion โœ… Ultra-wide FLOPs/param range (sparse โ†’ super-dense) โœ… Perfect load-balance (no aux loss) โœ… +20% throughput โœ… adaptive
6
70
359
@NiJinjie
Jinjie Ni
2 months
๐ŸทImagine you are the boss of Google DeepMind. To train the best diffusion language model in world within 1 year, using 800 TPU pods, which model size will you go for? ๐Ÿฟ๏ธย We build Quokka to help you decideโ€“the first-ever large-scale scaling law for DLMs. Interesting facts: 1.
6
58
287
@NiJinjie
Jinjie Ni
3 months
Token crisis: solved. โœ… We pre-trained diffusion language models (DLMs) vs. autoregressive (AR) models from scratch โ€” up to 8B params, 480B tokens, 480 epochs. Findings: > DLMs beat AR when tokens are limited, with >3ร— data potential. > A 1B DLM trained on just 1B tokens
42
247
2K
@Yunlong_Hou_
Yunlong
4 months
๐Ÿš€Thrilled to introduce BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms #ICML2025 A bandit approach that further boosts the throughput of speculative decoding by adaptively choosing the hyperparameters! Training-free with Theoretical Guarantees! 13 % / +19 %
1
4
4
@xingyaoL777
Xingyao Li
4 months
๐Ÿš€ Absolutely thrilled to share that our paper "Enhancing Long Video Consistency without Tuning" has been selected as Best Paper at the ICML โ€™25 Workshop on Building Physically Plausible World Models! Training-free TiARA + PromptBlend achieves a major leap in long-video
3
3
4
@zhuoran_yang
Zhuoran Yang
5 months
๐Ÿš€ We're excited to share our paper, "Taming Polysemanticity in LLMs," which introduces Group Bias Adaptation (GBA)โ€”the FIRST Sparse Autoencoder (SAE) training method with a provable guarantee for untangling monosemantic concepts! ๐Ÿ“„ Paper: https://t.co/f3L3VxEnHn ๐ŸŒ Website:
5
26
109
@QPHutu
Penghui Qi
6 months
๐Ÿ‘€Optimizing Anytime Reasoning via Budget Relative Policy Optimization๐Ÿ‘€ ๐Ÿš€Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.๐Ÿš€ ๐Ÿ“ฐPaper: https://t.co/Sm5HOB0pnx ๐Ÿ› ๏ธCode: https://t.co/vaxFvNiDJY
2
24
77