
Ruoyu Sun
@RuoyuSun_UI
Followers
1K
Following
280
Media
34
Statuses
101
Associate Prof at CUHK-Shenzhen. Prev: assistant prof @UofIllinois; postdoc @Stanford; visitor @AIatMeta Work on optimization of machine learning, DL, LLM.
Shenzhen, China
Joined December 2010
Neural nets' Hessians are often **nearly block diagonal**! There is little understanding when and why this will happen. We provide one of the first theoretical analysis using random matrix theory. Somewhat unexpectedly, we find that one primary driver is a large # classes C.
New paper alert! We report that the Hessian of NNs has a very special structure: .1. it appears to be a "block-diagonal-block-circulant" matrix at initialization;.2. then it quickly evolves into a "near-block-diagonal" matrix along training. We then theoretically reveal two
5
21
125
Come to our #ICLR25 posters today on GEM and Adam-mini if interested in LLM training! . I will unfortuantely miss the first few days and join workshops. 1) GEM, in paper "Preserving Diversity in Supervised Fine-Tuning of Large Language Models". It shows how to improve
1
0
14
Glad to see the community discussing the effect of beta1 and beta2 in Adam! Similar to this post, we had a sweep of beta1, beta2 before in our NeurIPS'22 paper Main findings: .1) Our paper uncovered a striking phase transition in the (β₁, β₂) plane –
I sweeped batchsize vs beta1 vs beta2 vs lr and plotted on optimal lr and batch size:. And what the actual fuck man.
0
2
7
#NeurIPS2024 I will present "Why Transformers Need Adam: A Hessian Perspective". East Exhibit Hall A-C #4803. Thursday 11am-2pm. In short: Transformer has Hetergeneous blocks of Hessian, making SGD bad. Will also be at FTML (fine-tuning) workshop at Saturday.
0
3
38
Heading to Vancouver for #NeurIPS2024 and stay all week. Will be happy to chat about LLM optimization, LLM theory, LLM applications, etc. Don’t hesitate to reach out (DM or whova)!.
0
0
13
Excited to share our paper "Why Transformers Need Adam: A Hessian Perspective" accepted at @NeurIPSConf . Intriguing question: Adam significantly outpeforms SGD on Transformers, including LLM training (Fig 1). Why? . Our explanation: .1) Transformer’s block-Hessians are
1
36
326