Ruoyu Sun Profile
Ruoyu Sun

@RuoyuSun_UI

Followers
1K
Following
280
Media
34
Statuses
101

Associate Prof at CUHK-Shenzhen. Prev: assistant prof @UofIllinois; postdoc @Stanford; visitor @AIatMeta Work on optimization of machine learning, DL, LLM.

Shenzhen, China
Joined December 2010
Don't wanna be here? Send us removal request.
@RuoyuSun_UI
Ruoyu Sun
2 months
Neural nets' Hessians are often **nearly block diagonal**! There is little understanding when and why this will happen. We provide one of the first theoretical analysis using random matrix theory. Somewhat unexpectedly, we find that one primary driver is a large # classes C.
@yushun_zzz
Yushun Zhang
2 months
New paper alert! We report that the Hessian of NNs has a very special structure: .1. it appears to be a "block-diagonal-block-circulant" matrix at initialization;.2. then it quickly evolves into a "near-block-diagonal" matrix along training. We then theoretically reveal two
Tweet media one
5
21
125
@RuoyuSun_UI
Ruoyu Sun
3 months
Two bits to share after watching ICLR Test of Time Award speech:.1️⃣ Adam paper was initially rejected at ICLR 2015!! . --Authors successfully appealed (how many would have given up?). This reminded me of DFP algorithm, whose 1st version was rejected initially. 2️⃣ Honored to
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
6
47
@RuoyuSun_UI
Ruoyu Sun
3 months
Come to our #ICLR25 posters today on GEM and Adam-mini if interested in LLM training! . I will unfortuantely miss the first few days and join workshops. 1) GEM, in paper "Preserving Diversity in Supervised Fine-Tuning of Large Language Models". It shows how to improve
Tweet media one
Tweet media two
1
0
14
@RuoyuSun_UI
Ruoyu Sun
4 months
Glad to see the community discussing the effect of beta1 and beta2 in Adam! Similar to this post, we had a sweep of beta1, beta2 before in our NeurIPS'22 paper Main findings: .1) Our paper uncovered a striking phase transition in the (β₁, β₂) plane –
Tweet media one
@cloneofsimo
Simo Ryu
4 months
I sweeped batchsize vs beta1 vs beta2 vs lr and plotted on optimal lr and batch size:. And what the actual fuck man.
Tweet media one
0
2
7
@RuoyuSun_UI
Ruoyu Sun
5 months
If you are interested in GRPO by deepseek, you might want to try ReMax, which is quite related to GRPO but perform better than GRPO. (ReMax was released in Oct 2023, and GRPO was released in Feb 2024). Similarity: (see 1st image). --Both are variants of REINFORCE; . --Both add
Tweet media one
Tweet media two
2
8
102
@RuoyuSun_UI
Ruoyu Sun
7 months
Will be at finetuning workshop, an oral presentation of a diversity keeping sft method GEM. It is a game formulation, with one player using RpGAN loss and another play using entropy regularized loss. Improving code pass rate by 7 points on Llama3.1-8B @ZiniuLi @chcoli
Tweet media one
Tweet media two
Tweet media three
0
1
13
@RuoyuSun_UI
Ruoyu Sun
7 months
#NeurIPS2024 I will present "Why Transformers Need Adam: A Hessian Perspective". East Exhibit Hall A-C #4803. Thursday 11am-2pm. In short: Transformer has Hetergeneous blocks of Hessian, making SGD bad. Will also be at FTML (fine-tuning) workshop at Saturday.
Tweet media one
Tweet media two
0
3
38
@RuoyuSun_UI
Ruoyu Sun
7 months
Heading to Vancouver for #NeurIPS2024 and stay all week. Will be happy to chat about LLM optimization, LLM theory, LLM applications, etc. Don’t hesitate to reach out (DM or whova)!.
0
0
13
@RuoyuSun_UI
Ruoyu Sun
10 months
Excited to share our paper "Why Transformers Need Adam: A Hessian Perspective" accepted at @NeurIPSConf . Intriguing question: Adam significantly outpeforms SGD on Transformers, including LLM training (Fig 1). Why? . Our explanation: .1) Transformer’s block-Hessians are
Tweet media one
Tweet media two
1
36
326