Atli Kosson Profile
Atli Kosson

@AtliKosson

Followers
406
Following
558
Media
19
Statuses
67

PhD student at @EPFL🇨🇭working on improved understanding of deep neural networks and their optimization.

Lausanne, Switzerland
Joined July 2022
Don't wanna be here? Send us removal request.
@AtliKosson
Atli Kosson
2 years
Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for https://t.co/D8i8u3fSsd 1/10
4
44
206
@AtliKosson
Atli Kosson
3 days
Tagging a couple of people who worked on µP before and might find this interesting: @TheGregYang @jxbz @xidulu @laurence_ai @_katieeverett @Mitchnw @thecharlieblake @ShaneBergsma @DeyNolan @QuanquanGu @cloneofsimo @SeunghyunSEO7
0
1
18
@AtliKosson
Atli Kosson
3 days
See the full paper for additional details and insights about LR transfer in practice! Paper link: https://t.co/49fUImVsaX Grateful for the opportunity to work on this project during my internship at Amazon FAR with @jerwelborn @largelymfs @peterxichen! 🧵8/8
2
7
34
@AtliKosson
Atli Kosson
3 days
So what's µP really doing? It's creating an implicit learning rate warmup! When µP is combined with independent WD, updates start proportionally smaller before reaching the same long-term behavior. We can get similar benefits without µP by adding an exponential LR warmup! 🧵7/8
2
3
27
@AtliKosson
Atli Kosson
3 days
Why override µP? Because its core assumptions only hold very early in training! In practice wide models quickly stop being more sensitive to weight updates than smaller models! This is caused by changes in the geometric alignment of updates and layer inputs over training. 🧵6/8
2
10
61
@AtliKosson
Atli Kosson
3 days
Surprisingly independent WD works because it overrides µP's scaling! µP makes the updates proportionally smaller for wider models, but independent WD eventually makes them equally large across widths. This turns out to be exactly what's needed for stable feature learning! 🧵5/8
1
3
25
@AtliKosson
Atli Kosson
3 days
AdamW has two common formulations of weight decay. They are usually equivalent but behave differently when µP scales the LR: Standard WD: multiply weights by (1 - LR×WD) Independent WD: multiply weights by (1 - WD) Independent WD is essential for good transfer—but why? 🧵4/8
1
2
19
@AtliKosson
Atli Kosson
3 days
µP takes in a base LR and scales it to keep the size of internal feature updates stable across model widths. It assumes wider models are more sensitive, requiring proportionally smaller weight updates to achieve a given feature change, so µP scales down the LR with width. 🧵3/8
1
2
19
@AtliKosson
Atli Kosson
3 days
Getting the LR right is crucial for efficient training, but tuning it directly on large models is often prohibitively expensive! LR transfer lets you tune on small models and apply to large ones. The Maximal Update Parameterization (µP) is the go-to method for this. 🧵2/8
1
2
22
@AtliKosson
Atli Kosson
3 days
The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵
11
44
311
@Jianlin_S
jianlin.su
26 days
Asymptotic Estimate of Weight RMS for AdamW https://t.co/9Lxm4c2occ
0
16
126
@kellerjordan0
Keller Jordan
10 months
I would like to issue a citation request for Muon to the following newly appearing paper from Microsoft Research: Ma et al. (2024). SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction. https://t.co/fJOFaMTLES 1/5
Tweet card summary image
arxiv.org
Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which...
7
26
228
@haeggee
Alex Hägele
1 year
I'm also at ICML -- excited to present our paper on training + LR schedules as a spotlight (!) at the workshop on the next gen of seq. models as well as ES-FOMO on Fri🤙 Reach out to discuss methods for training open models, scaling, efficiency, or the future of architectures :)
@haeggee
Alex Hägele
1 year
Why exactly do we train LLMs with the cosine schedule, still?🤔 Maybe we do not actually have to -- and that would come with a lot of benefits :) 🧵Our paper on LR schedules, compute-optimality and more affordable scaling laws
1
17
110
@AtliKosson
Atli Kosson
1 year
Do we really need learning rate warmup for GPT pre-training? We find that better normalization of the updates can be a sufficient alternative! We are presenting our preliminary results at the #ICML2024 HiLD workshop poster sessions on Friday (10-11 am, 3:30-4:45 pm in Straus 2).
3
6
23
@AtliKosson
Atli Kosson
1 year
Here is the ICML event link https://t.co/3cd0IMEFMJ and a previous thread with more details https://t.co/gwCMXl5Dwm. Joint work with @BeMePhD who is also here at ICML and Prof Martin Jaggi.
@AtliKosson
Atli Kosson
2 years
Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for https://t.co/D8i8u3fSsd 1/10
0
2
3
@AtliKosson
Atli Kosson
1 year
We’re at #ICML2024 presenting our study of weight decay, perhaps the most widely misunderstood method in deep learning. Drop by poster 304 on Thu morning to discuss how weight decay benefits training, why it outperforms L2 regularization for Adam, and how it can be replaced!
1
4
20
@jxbz
Jeremy Bernstein
1 year
New paper and pip package: modula: "Scalable Optimization in the Modular Norm" 📦 https://t.co/ztWVPShp1p 📝 https://t.co/UnVL9iY8kB We re-wrote the @pytorch module tree so that training automatically scales across width and depth.
8
37
177
@vitaliychiley
Vitaliy Chiley
2 years
Introducing DBRX: A New Standard for Open LLM 🔔 https://t.co/0HpI6Sdv6J 💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens 🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks. Is this thread mostly written by DBRX? Yes! 🧵
22
83
472
@MatPagliardini
Matteo Pagliardini
2 years
A tweak in the architecture of #Transformers can significantly boost accuracy! With direct access to all previous blocks’ outputs, a 48-block #DenseFormer outperforms a 72-block Transformer, with faster inference! A work with @akmohtashami_a,@francoisfleuret, Martin Jaggi. 1/🧵
26
166
1K
@AtliKosson
Atli Kosson
2 years
We hope this shows there is a lot more to weight decay than most realize! Joint work with Bettina Messmer and Martin Jaggi, full paper link
Tweet card summary image
arxiv.org
This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can...
0
0
7