Atli Kosson @AtliKosson X Profile

Atli Kosson

@AtliKosson

Followers

406

Following

558

Media

19

Statuses

67

PhD student at @EPFL🇨🇭working on improved understanding of deep neural networks and their optimization.

Lausanne, Switzerland

Joined July 2022

Don't wanna be here? Send us removal request.

Atli Kosson

@AtliKosson

2 years

Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for https://t.co/D8i8u3fSsd 1/10

4

44

206

Atli Kosson

@AtliKosson

3 days

Tagging a couple of people who worked on µP before and might find this interesting: @TheGregYang @jxbz @xidulu @laurence_ai @_katieeverett @Mitchnw @thecharlieblake @ShaneBergsma @DeyNolan @QuanquanGu @cloneofsimo @SeunghyunSEO7

0

1

18

Atli Kosson

@AtliKosson

3 days

See the full paper for additional details and insights about LR transfer in practice! Paper link: https://t.co/49fUImVsaX Grateful for the opportunity to work on this project during my internship at Amazon FAR with @jerwelborn @largelymfs @peterxichen! 🧵8/8

2

7

34

Atli Kosson

@AtliKosson

3 days

So what's µP really doing? It's creating an implicit learning rate warmup! When µP is combined with independent WD, updates start proportionally smaller before reaching the same long-term behavior. We can get similar benefits without µP by adding an exponential LR warmup! 🧵7/8

2

3

27

Atli Kosson

@AtliKosson

3 days

Why override µP? Because its core assumptions only hold very early in training! In practice wide models quickly stop being more sensitive to weight updates than smaller models! This is caused by changes in the geometric alignment of updates and layer inputs over training. 🧵6/8

2

10

61

Atli Kosson

@AtliKosson

3 days

Surprisingly independent WD works because it overrides µP's scaling! µP makes the updates proportionally smaller for wider models, but independent WD eventually makes them equally large across widths. This turns out to be exactly what's needed for stable feature learning! 🧵5/8

1

3

25

Atli Kosson

@AtliKosson

3 days

AdamW has two common formulations of weight decay. They are usually equivalent but behave differently when µP scales the LR: Standard WD: multiply weights by (1 - LR×WD) Independent WD: multiply weights by (1 - WD) Independent WD is essential for good transfer—but why? 🧵4/8

1

2

19

Atli Kosson

@AtliKosson

3 days

µP takes in a base LR and scales it to keep the size of internal feature updates stable across model widths. It assumes wider models are more sensitive, requiring proportionally smaller weight updates to achieve a given feature change, so µP scales down the LR with width. 🧵3/8

1

2

19

Atli Kosson

@AtliKosson

3 days

Getting the LR right is crucial for efficient training, but tuning it directly on large models is often prohibitively expensive! LR transfer lets you tune on small models and apply to large ones. The Maximal Update Parameterization (µP) is the go-to method for this. 🧵2/8

1

2

22

Atli Kosson

@AtliKosson

3 days

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

11

44

311

jianlin.su

@Jianlin_S

26 days

Asymptotic Estimate of Weight RMS for AdamW https://t.co/9Lxm4c2occ

0

16

126

Keller Jordan

@kellerjordan0

10 months

I would like to issue a citation request for Muon to the following newly appearing paper from Microsoft Research: Ma et al. (2024). SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction. https://t.co/fJOFaMTLES 1/5

arxiv.org

Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which...

7

26

228

Alex Hägele

@haeggee

1 year

I'm also at ICML -- excited to present our paper on training + LR schedules as a spotlight (!) at the workshop on the next gen of seq. models as well as ES-FOMO on Fri🤙 Reach out to discuss methods for training open models, scaling, efficiency, or the future of architectures :)

Alex Hägele

@haeggee

1 year

Why exactly do we train LLMs with the cosine schedule, still?🤔 Maybe we do not actually have to -- and that would come with a lot of benefits :) 🧵Our paper on LR schedules, compute-optimality and more affordable scaling laws

1

17

110

Atli Kosson

@AtliKosson

1 year

Workshop paper link https://t.co/5pGSDC7M20. Joint work with @BeMePhD (also at ICML) and Prof Martin Jaggi.

openreview.net

Learning Rate Warmup is a popular heuristic for training neural networks, which downscales early updates relative to later ones. This aids training, suggesting that the initial updates are too...

0

1

2

Atli Kosson

@AtliKosson

1 year

Do we really need learning rate warmup for GPT pre-training? We find that better normalization of the updates can be a sufficient alternative! We are presenting our preliminary results at the #ICML2024 HiLD workshop poster sessions on Friday (10-11 am, 3:30-4:45 pm in Straus 2).

3

6

23

Atli Kosson

@AtliKosson

1 year

Here is the ICML event link https://t.co/3cd0IMEFMJ and a previous thread with more details https://t.co/gwCMXl5Dwm. Joint work with @BeMePhD who is also here at ICML and Prof Martin Jaggi.

Atli Kosson

@AtliKosson

2 years

Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for https://t.co/D8i8u3fSsd 1/10

0

2

3

Atli Kosson

@AtliKosson

1 year

We’re at #ICML2024 presenting our study of weight decay, perhaps the most widely misunderstood method in deep learning. Drop by poster 304 on Thu morning to discuss how weight decay benefits training, why it outperforms L2 regularization for Adam, and how it can be replaced!

1

4

20

Jeremy Bernstein

@jxbz

1 year

New paper and pip package: modula: "Scalable Optimization in the Modular Norm" 📦 https://t.co/ztWVPShp1p 📝 https://t.co/UnVL9iY8kB We re-wrote the @pytorch module tree so that training automatically scales across width and depth.

8

37

177

Vitaliy Chiley

@vitaliychiley

2 years

Introducing DBRX: A New Standard for Open LLM 🔔 https://t.co/0HpI6Sdv6J 💻 DBRX is a 16x 12B MoE LLM trained on 📜 12T tokens 🧠DBRX sets a new standard for open LLMs, outperforming established models on various benchmarks. Is this thread mostly written by DBRX? Yes! 🧵

22

83

472

Matteo Pagliardini

@MatPagliardini

2 years

A tweak in the architecture of #Transformers can significantly boost accuracy! With direct access to all previous blocks’ outputs, a 48-block #DenseFormer outperforms a 72-block Transformer, with faster inference! A work with @akmohtashami_a,@francoisfleuret, Martin Jaggi. 1/🧵

26

166

1K

Atli Kosson

@AtliKosson

2 years

We hope this shows there is a lot more to weight decay than most realize! Joint work with Bettina Messmer and Martin Jaggi, full paper link

arxiv.org

This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can...

0

7