
Tuo Zhao
@tourzhao
Followers
2K
Following
578
Media
32
Statuses
375
Associate Professor at Georgia Tech, Ph.D. in Computer Science. Research Interests: Machine Learning
Atlanta, Georgia
Joined August 2019
🚀 New release for the Phi family! . **SlimMOE** ( trims bulky Φ-3.5-MoE experts into agile models (4-6× smaller) with MINIMAL accuracy loss. If you ❤️ Phi-3 mini/small, you’ll love these lighter siblings.👇.
arxiv.org
The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their enormous memory...
1
7
24
@zichong_li @chenliang1_ @Zixuan_Zzz @HongIlgee @WeizhuChen @mlatgt @GeorgiaTechISyE @GTCSE @Microsoft 🔧 With memory-efficient optimizers like 8-bit Adam or Muon, you can fine-tune phi-moe-mini-instruct on a single A100 and phi-moe-tiny-instruct on a single A6000. Perfect testbeds for MoE research when resources are tight! #PhiSeries #SlimMOE.
0
0
1
Joint work with @zichong_li @chenliang1_ @Zixuan_Zzz @HongIlgee Young Jin Kim and @WeizhuChen (@mlatgt @GeorgiaTechISyE @GTCSE @Microsoft).
1
0
0
🥇 Meet the smallest Phi MoE yet: **phi-moe-tiny-instruct** (3.8 B total / 1.1 B active). Same SlimMOE magic, instruction-tuned for real tasks, now light enough for laptops & mobile GPUs. Grab the weights 👉 .#TinyPhi #MoE #OpenSource.
huggingface.co
1
0
2
🚀 New to the Phi family: **phi-moe-mini-instruct** (7.6 B total / 2.4 B active)! .SlimMOE trims Phi-3.5-MoE 6× while preserving almost all accuracy—ideal for edge inference. Try it here 👉 .#SlimMOE #Phi #LLM.
huggingface.co
1
0
1
Joint work with @li_zichong Xinyu Feng, Yuheng Cai, @Zixuan_Zzz @chenliang1_ @WeizhuChen Tianyi Liu and Haoyu Wang.
0
0
0
Specifically, we prove 1) the existence of progressive sharpening and self-stabilization under large learning rates, 2) sharpness upper bound for the entire GD trajectory, 3) the non-monotonic loss is essentially monotonic when projected to the relevant dimension. (3/3) #EoS.
1
0
0
🔍 We introduces a nontrivial two-layer linear network with 2D input, where one dimension is relevant to the response and the other is irrelevant. Such an input structure reveals new insights about the EoS phenomenon. (2/3) #MachineLearningTheory.
1
0
0
📢 Check our arXiv preprint: "A Minimalist Example of Edge-of-Stability and Progressive Sharpening" (. We prove progressive sharpening and self-stabilization of gradient descent under large learning rates for training linear networks. #DeepLearning (1/3)
1
4
17
Grateful for my awesome collabrators: @Liming_Liu6, @ZhenghaoXu0, @Zixuan_Zzz, @li_zichong, @GT_HaoKang, @chenliang1_, @WeizhuChen (3/3).
0
0
1