FSchaipp Profile Banner
Fabian Schaipp Profile
Fabian Schaipp

@FSchaipp

Followers
1K
Following
2K
Media
78
Statuses
472

working on optimization for machine learning. currently postdoc @inria_paris.

Paris, France
Joined July 2020
Don't wanna be here? Send us removal request.
@FSchaipp
Fabian Schaipp
9 months
Learning rate schedules seem mysterious? Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper 🚇 https://t.co/DGHoG1FS3f
Tweet card summary image
arxiv.org
We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant...
@aaron_defazio
Aaron Defazio
9 months
The sudden loss drop when annealing the learning rate at the end of a WSD (warmup-stable-decay) schedule can be explained without relying on non-convexity or even smoothness, a new paper shows that it can be precisely predicted by theory in the convex, non-smooth setting! 1/2
5
28
135
@FSchaipp
Fabian Schaipp
4 days
love to see how well MoMo works combined with Muon
@gowerrobert
Robert M. Gower 🇺🇦
8 days
We've just finished some work on improving the sensitivity of Muon to the learning rate, and exploring a lot of design choices. If you want to see how we did this, follow me ....1/x (Work lead by the amazing @CrichaelMawshaw)
0
0
12
@FSchaipp
Fabian Schaipp
18 days
3) use Prodigy if you want to avoid LR tuning. 4) Adam >> SGD, and previous explanations for this gap don't apply here. so sth else needs to cause this.
0
0
5
@FSchaipp
Fabian Schaipp
18 days
2) there seems to be a mismatch between loss and generative quality sometimes. this happens for ScheduleFree and also, but less pronounced, for WSD schedule. Studying what causes this in detail should be highly relevant.
1
0
4
@FSchaipp
Fabian Schaipp
18 days
1) Muon and Soap are very efficient and reach lower losses than AdamW within same time budget.
1
0
5
@FSchaipp
Fabian Schaipp
18 days
Our benchmark problem is training a diffusion model targeted to scientific applications. 🌪️ (denoising Navier-Stokes flows, which can be used for data assimilation) ⏩ Different context to LLM training wrt. model arch, data domain and general training regime.
1
0
5
@FSchaipp
Fabian Schaipp
18 days
What are good optimizers for diffusion models? 🍂 TLDR: Muon and SOAP are very good. Paper: https://t.co/TYqRpfcu5t
7
45
332
@AndreiSemenov17
Andrei Semenov
22 days
Good to see SOAP and Muon being quite performant in another setting — training of Diffusion Models. Similarly to our benchmark, the authors find Prodigy a decent “proxy-optimizer” for tuning hyperparams of Adam-like methods https://t.co/PixDaLAMJb
0
12
86
@FSchaipp
Fabian Schaipp
1 month
TIL: Even the Sophia authors couldn't reproduce the Sophia paper's results. source: https://t.co/ianMUcq5cR
1
3
57
@FSchaipp
Fabian Schaipp
1 month
most stylish theatre i've been to. don't miss the coffee bar in the break.
@luusssso
lusso
1 year
The amazing lobby of the Teatro Regio in Turin, Italy by architect Carlo Mollino
0
0
2
@FSchaipp
Fabian Schaipp
1 month
when the paper title is a question, you can usually guess the "answer"
@YouJiacheng
You Jiacheng
2 months
It's only Monday
0
0
2
@FSchaipp
Fabian Schaipp
2 months
1
0
1
@FSchaipp
Fabian Schaipp
2 months
is there a #Pytorch implementation for Muon that handles convolutional layers properly (not just reshape to 2D)?
2
0
3
@FSchaipp
Fabian Schaipp
2 months
weight decay seems to be a hot topic of this year's ICLR submissions 👀
1
0
19
@FSchaipp
Fabian Schaipp
2 months
are models getting nervous when they are set from .train() to .eval()?
0
0
5
@pratyushmaini
Pratyush Maini
2 months
If you’re scrambling a last-minute submission with an uncertain result, remember: putting it off is hard in the moment. It will sting for 10 minutes (because you care so deeply), but in 10 months you’ll be incredibly proud you made the scientifically rigorous call.
3
4
120
@FSchaipp
Fabian Schaipp
2 months
1
0
0
@FSchaipp
Fabian Schaipp
2 months
So, what's the least frightening ressource for μP and "proving" hyperparameter transfer?
2
0
3
@FSchaipp
Fabian Schaipp
2 months
what a beauty
0
1
10
@Perowinger94
Ingwar Perowanowitsch
2 months
Die Berliner Verkehrssenatorin Uta Bonde (CDU) jetzt im Gespräch mit dem Tagesspiegel zum Thema Schulwegsicherheit: „Wir können nicht nach Gutdünken Tempo 30 einführen“ Autofreie Schulstraßen wie in Paris sieht sie skeptisch. Dafür ihr Rat an alle Kinder und Jugendlichen: „Helm
94
414
2K
@haeggee
Alex Hägele
2 months
Long in the making, finally released: Apertus-8B and Apertus-70B, trained on 15T tokens of open data from over 1800 languages. Unique opportunity in academia to work on and train LLMs across the full-stack. We managed to pull off a pretraining run with some fun innovations, ...
@cscsch
CSCS Lugano
2 months
@EPFL , @ETH_en and #CSCS today released Apertus, Switzerland's first large-scale, multilingual language model (LLM). As a fully open LLM, it serves as a building block for developers and organizations to create their own applications: https://t.co/7bJlINiIdn #Apertus #AI
10
31
233