Ruoyu Sun @RuoyuSun_UI X Profile

Ruoyu Sun

@RuoyuSun_UI

Followers

1K

Following

280

Media

34

Statuses

101

Associate Prof at CUHK-Shenzhen. Prev: assistant prof @UofIllinois; postdoc @Stanford; visitor @AIatMeta Work on optimization of machine learning, DL, LLM.

Shenzhen, China

Joined December 2010

Don't wanna be here? Send us removal request.

Ruoyu Sun

@RuoyuSun_UI

2 months

Neural nets' Hessians are often **nearly block diagonal**! There is little understanding when and why this will happen. We provide one of the first theoretical analysis using random matrix theory. Somewhat unexpectedly, we find that one primary driver is a large # classes C.

Yushun Zhang

@yushun_zzz

2 months

New paper alert! We report that the Hessian of NNs has a very special structure: .1. it appears to be a "block-diagonal-block-circulant" matrix at initialization;.2. then it quickly evolves into a "near-block-diagonal" matrix along training. We then theoretically reveal two

5

21

125

Ruoyu Sun

@RuoyuSun_UI

3 months

Two bits to share after watching ICLR Test of Time Award speech:.1️⃣ Adam paper was initially rejected at ICLR 2015!! . --Authors successfully appealed (how many would have given up?). This reminded me of DFP algorithm, whose 1st version was rejected initially. 2️⃣ Honored to

1

6

47

Ruoyu Sun

@RuoyuSun_UI

3 months

Come to our #ICLR25 posters today on GEM and Adam-mini if interested in LLM training! . I will unfortuantely miss the first few days and join workshops. 1) GEM, in paper "Preserving Diversity in Supervised Fine-Tuning of Large Language Models". It shows how to improve

1

0

14

Ruoyu Sun

@RuoyuSun_UI

4 months

Glad to see the community discussing the effect of beta1 and beta2 in Adam! Similar to this post, we had a sweep of beta1, beta2 before in our NeurIPS'22 paper Main findings: .1) Our paper uncovered a striking phase transition in the (β₁, β₂) plane –

Simo Ryu

@cloneofsimo

4 months

I sweeped batchsize vs beta1 vs beta2 vs lr and plotted on optimal lr and batch size:. And what the actual fuck man.

0

2

7

Ruoyu Sun

@RuoyuSun_UI

5 months

If you are interested in GRPO by deepseek, you might want to try ReMax, which is quite related to GRPO but perform better than GRPO. （ReMax was released in Oct 2023, and GRPO was released in Feb 2024). Similarity: (see 1st image). --Both are variants of REINFORCE; . --Both add

2

8

102

Ruoyu Sun

@RuoyuSun_UI

7 months

Will be at finetuning workshop, an oral presentation of a diversity keeping sft method GEM. It is a game formulation, with one player using RpGAN loss and another play using entropy regularized loss. Improving code pass rate by 7 points on Llama3.1-8B @ZiniuLi @chcoli

0

1

13

Ruoyu Sun

@RuoyuSun_UI

7 months

#NeurIPS2024 I will present "Why Transformers Need Adam: A Hessian Perspective". East Exhibit Hall A-C #4803. Thursday 11am-2pm. In short: Transformer has Hetergeneous blocks of Hessian, making SGD bad. Will also be at FTML (fine-tuning) workshop at Saturday.

0

3

38

Ruoyu Sun

@RuoyuSun_UI

7 months

Heading to Vancouver for #NeurIPS2024 and stay all week. Will be happy to chat about LLM optimization, LLM theory, LLM applications, etc. Don’t hesitate to reach out (DM or whova)!.

0

13

Ruoyu Sun

@RuoyuSun_UI

10 months

Excited to share our paper "Why Transformers Need Adam: A Hessian Perspective" accepted at @NeurIPSConf . Intriguing question: Adam significantly outpeforms SGD on Transformers, including LLM training (Fig 1). Why? . Our explanation: .1) Transformer’s block-Hessians are

1

36

326