Fabian Schaipp
@FSchaipp
Followers
1K
Following
2K
Media
78
Statuses
472
working on optimization for machine learning. currently postdoc @inria_paris.
Paris, France
Joined July 2020
Learning rate schedules seem mysterious? Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper 🚇 https://t.co/DGHoG1FS3f
arxiv.org
We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant...
The sudden loss drop when annealing the learning rate at the end of a WSD (warmup-stable-decay) schedule can be explained without relying on non-convexity or even smoothness, a new paper shows that it can be precisely predicted by theory in the convex, non-smooth setting! 1/2
5
28
135
love to see how well MoMo works combined with Muon
We've just finished some work on improving the sensitivity of Muon to the learning rate, and exploring a lot of design choices. If you want to see how we did this, follow me ....1/x (Work lead by the amazing @CrichaelMawshaw)
0
0
12
3) use Prodigy if you want to avoid LR tuning. 4) Adam >> SGD, and previous explanations for this gap don't apply here. so sth else needs to cause this.
0
0
5
2) there seems to be a mismatch between loss and generative quality sometimes. this happens for ScheduleFree and also, but less pronounced, for WSD schedule. Studying what causes this in detail should be highly relevant.
1
0
4
1) Muon and Soap are very efficient and reach lower losses than AdamW within same time budget.
1
0
5
Our benchmark problem is training a diffusion model targeted to scientific applications. 🌪️ (denoising Navier-Stokes flows, which can be used for data assimilation) ⏩ Different context to LLM training wrt. model arch, data domain and general training regime.
1
0
5
What are good optimizers for diffusion models? 🍂 TLDR: Muon and SOAP are very good. Paper: https://t.co/TYqRpfcu5t
7
45
332
Good to see SOAP and Muon being quite performant in another setting — training of Diffusion Models. Similarly to our benchmark, the authors find Prodigy a decent “proxy-optimizer” for tuning hyperparams of Adam-like methods https://t.co/PixDaLAMJb
0
12
86
TIL: Even the Sophia authors couldn't reproduce the Sophia paper's results. source: https://t.co/ianMUcq5cR
1
3
57
when the paper title is a question, you can usually guess the "answer"
0
0
2
weight decay seems to be a hot topic of this year's ICLR submissions 👀
1
0
19
are models getting nervous when they are set from .train() to .eval()?
0
0
5
If you’re scrambling a last-minute submission with an uncertain result, remember: putting it off is hard in the moment. It will sting for 10 minutes (because you care so deeply), but in 10 months you’ll be incredibly proud you made the scientifically rigorous call.
3
4
120
So, what's the least frightening ressource for μP and "proving" hyperparameter transfer?
2
0
3
Die Berliner Verkehrssenatorin Uta Bonde (CDU) jetzt im Gespräch mit dem Tagesspiegel zum Thema Schulwegsicherheit: „Wir können nicht nach Gutdünken Tempo 30 einführen“ Autofreie Schulstraßen wie in Paris sieht sie skeptisch. Dafür ihr Rat an alle Kinder und Jugendlichen: „Helm
94
414
2K
Long in the making, finally released: Apertus-8B and Apertus-70B, trained on 15T tokens of open data from over 1800 languages. Unique opportunity in academia to work on and train LLMs across the full-stack. We managed to pull off a pretraining run with some fun innovations, ...
@EPFL , @ETH_en and #CSCS today released Apertus, Switzerland's first large-scale, multilingual language model (LLM). As a fully open LLM, it serves as a building block for developers and organizations to create their own applications: https://t.co/7bJlINiIdn
#Apertus #AI
10
31
233