
Fabian Schaipp
@FSchaipp
Followers
1K
Following
2K
Media
72
Statuses
439
working on optimization for machine learning. currently postdoc @inria_paris. sbatch and apero.
Paris, France
Joined July 2020
Learning rate schedules seem mysterious? Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper 🚇 https://t.co/DGHoG1FS3f
arxiv.org
We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant...
The sudden loss drop when annealing the learning rate at the end of a WSD (warmup-stable-decay) schedule can be explained without relying on non-convexity or even smoothness, a new paper shows that it can be precisely predicted by theory in the convex, non-smooth setting! 1/2
5
26
132
So, what's the least frightening ressource for μP and "proving" hyperparameter transfer?
2
0
2
Die Berliner Verkehrssenatorin Uta Bonde (CDU) jetzt im Gespräch mit dem Tagesspiegel zum Thema Schulwegsicherheit: „Wir können nicht nach Gutdünken Tempo 30 einführen“ Autofreie Schulstraßen wie in Paris sieht sie skeptisch. Dafür ihr Rat an alle Kinder und Jugendlichen: „Helm
89
403
2K
Long in the making, finally released: Apertus-8B and Apertus-70B, trained on 15T tokens of open data from over 1800 languages. Unique opportunity in academia to work on and train LLMs across the full-stack. We managed to pull off a pretraining run with some fun innovations, ...
@EPFL , @ETH_en and #CSCS today released Apertus, Switzerland's first large-scale, multilingual language model (LLM). As a fully open LLM, it serves as a building block for developers and organizations to create their own applications: https://t.co/7bJlINiIdn
#Apertus #AI
10
28
226
🚟 New blog post: On "infinite" learning-rate schedules and how to construct them from one checkpoint to the next https://t.co/xa1DS9OTTW
fabian-sp.github.io
TL;DR: Knowing the next training checkpoint in advance (“lookahead”) helps to set the learning rate. In the limit, the classical square-root schedule appears on the horizon.
1
13
81
FYI Adam will be on holiday the entire august
0
0
8
🚡 Come check out our poster on understanding LR schedules at ICML. Thursday 11am.
4
10
118
A paper that contains both the words "sigma-algebra" and "SwiGLU activations" ☑️ Also interesting results on embedding layer LRs.
0
8
79
We uploaded V3 of our draft book "The Elements of Differentiable Programming". Lots of typo fixes, clarity improvements, new figures and a new section on Transformers!
arxiv.org
Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of...
8
78
460
is it allowed to write papers on μP only subject to using the most un-intuitive notation?
3
0
12
on a more serious note: thanks to @fpedregosa and colleagues for this benchmark. happy to see MoMo works reasonably well out of the box on problems we never tested it on
0
0
5
✒️ Cycle length of one is also optimal for the suboptimality bound we consider. The empirical loss curves and the bound for different cycle lengths match again.
0
0
2
Short thread on (imo) neat finding of our LR-schedules paper: The Chinchilla paper showed that *cosine cycle of 1* works best for pretraining. That is, the cosine schedule should do exactly one half-cosine stretched over training. Why is this?
1
2
16
what are the best ressources for training and inference setup in diffusion models? ideally with (pseudo-)code
2
0
4
Optimization is the natural language of applied mathematics.
1
0
11