Atli Kosson
@AtliKosson
Followers
406
Following
558
Media
19
Statuses
67
PhD student at @EPFL🇨🇭working on improved understanding of deep neural networks and their optimization.
Lausanne, Switzerland
Joined July 2022
Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for https://t.co/D8i8u3fSsd 1/10
4
44
206
Tagging a couple of people who worked on µP before and might find this interesting: @TheGregYang @jxbz @xidulu @laurence_ai @_katieeverett @Mitchnw @thecharlieblake @ShaneBergsma @DeyNolan @QuanquanGu @cloneofsimo @SeunghyunSEO7
0
1
18
See the full paper for additional details and insights about LR transfer in practice! Paper link: https://t.co/49fUImVsaX Grateful for the opportunity to work on this project during my internship at Amazon FAR with @jerwelborn @largelymfs @peterxichen! 🧵8/8
2
7
34
So what's µP really doing? It's creating an implicit learning rate warmup! When µP is combined with independent WD, updates start proportionally smaller before reaching the same long-term behavior. We can get similar benefits without µP by adding an exponential LR warmup! 🧵7/8
2
3
27
Why override µP? Because its core assumptions only hold very early in training! In practice wide models quickly stop being more sensitive to weight updates than smaller models! This is caused by changes in the geometric alignment of updates and layer inputs over training. 🧵6/8
2
10
61
Surprisingly independent WD works because it overrides µP's scaling! µP makes the updates proportionally smaller for wider models, but independent WD eventually makes them equally large across widths. This turns out to be exactly what's needed for stable feature learning! 🧵5/8
1
3
25
AdamW has two common formulations of weight decay. They are usually equivalent but behave differently when µP scales the LR: Standard WD: multiply weights by (1 - LR×WD) Independent WD: multiply weights by (1 - WD) Independent WD is essential for good transfer—but why? 🧵4/8
1
2
19
µP takes in a base LR and scales it to keep the size of internal feature updates stable across model widths. It assumes wider models are more sensitive, requiring proportionally smaller weight updates to achieve a given feature change, so µP scales down the LR with width. 🧵3/8
1
2
19
Getting the LR right is crucial for efficient training, but tuning it directly on large models is often prohibitively expensive! LR transfer lets you tune on small models and apply to large ones. The Maximal Update Parameterization (µP) is the go-to method for this. 🧵2/8
1
2
22
The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵
11
44
311
I would like to issue a citation request for Muon to the following newly appearing paper from Microsoft Research: Ma et al. (2024). SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction. https://t.co/fJOFaMTLES 1/5
arxiv.org
Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which...
7
26
228
I'm also at ICML -- excited to present our paper on training + LR schedules as a spotlight (!) at the workshop on the next gen of seq. models as well as ES-FOMO on Fri🤙 Reach out to discuss methods for training open models, scaling, efficiency, or the future of architectures :)
Why exactly do we train LLMs with the cosine schedule, still?🤔 Maybe we do not actually have to -- and that would come with a lot of benefits :) 🧵Our paper on LR schedules, compute-optimality and more affordable scaling laws
1
17
110
Workshop paper link https://t.co/5pGSDC7M20. Joint work with @BeMePhD (also at ICML) and Prof Martin Jaggi.
openreview.net
Learning Rate Warmup is a popular heuristic for training neural networks, which downscales early updates relative to later ones. This aids training, suggesting that the initial updates are too...
0
1
2
Do we really need learning rate warmup for GPT pre-training? We find that better normalization of the updates can be a sufficient alternative! We are presenting our preliminary results at the #ICML2024 HiLD workshop poster sessions on Friday (10-11 am, 3:30-4:45 pm in Straus 2).
3
6
23
Here is the ICML event link https://t.co/3cd0IMEFMJ and a previous thread with more details https://t.co/gwCMXl5Dwm. Joint work with @BeMePhD who is also here at ICML and Prof Martin Jaggi.
Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for https://t.co/D8i8u3fSsd 1/10
0
2
3
We’re at #ICML2024 presenting our study of weight decay, perhaps the most widely misunderstood method in deep learning. Drop by poster 304 on Thu morning to discuss how weight decay benefits training, why it outperforms L2 regularization for Adam, and how it can be replaced!
1
4
20
New paper and pip package: modula: "Scalable Optimization in the Modular Norm" 📦 https://t.co/ztWVPShp1p 📝 https://t.co/UnVL9iY8kB We re-wrote the @pytorch module tree so that training automatically scales across width and depth.
8
37
177
Introducing DBRX: A New Standard for Open LLM 🔔 https://t.co/0HpI6Sdv6J 💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens 🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks. Is this thread mostly written by DBRX? Yes! 🧵
22
83
472
A tweak in the architecture of #Transformers can significantly boost accuracy! With direct access to all previous blocks’ outputs, a 48-block #DenseFormer outperforms a 72-block Transformer, with faster inference! A work with @akmohtashami_a,@francoisfleuret, Martin Jaggi. 1/🧵
26
166
1K
We hope this shows there is a lot more to weight decay than most realize! Joint work with Bettina Messmer and Martin Jaggi, full paper link
arxiv.org
This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can...
0
0
7