FSchaipp Profile Banner
Fabian Schaipp Profile
Fabian Schaipp

@FSchaipp

Followers
1K
Following
2K
Media
72
Statuses
439

working on optimization for machine learning. currently postdoc @inria_paris. sbatch and apero.

Paris, France
Joined July 2020
Don't wanna be here? Send us removal request.
@FSchaipp
Fabian Schaipp
7 months
Learning rate schedules seem mysterious? Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper 🚇 https://t.co/DGHoG1FS3f
Tweet card summary image
arxiv.org
We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant...
@aaron_defazio
Aaron Defazio
7 months
The sudden loss drop when annealing the learning rate at the end of a WSD (warmup-stable-decay) schedule can be explained without relying on non-convexity or even smoothness, a new paper shows that it can be precisely predicted by theory in the convex, non-smooth setting! 1/2
Tweet media one
5
26
132
@FSchaipp
Fabian Schaipp
1 day
1
0
0
@FSchaipp
Fabian Schaipp
1 day
So, what's the least frightening ressource for μP and "proving" hyperparameter transfer?
2
0
2
@FSchaipp
Fabian Schaipp
3 days
what a beauty
Tweet media one
0
1
10
@Perowinger94
Ingwar Perowanowitsch
5 days
Die Berliner Verkehrssenatorin Uta Bonde (CDU) jetzt im Gespräch mit dem Tagesspiegel zum Thema Schulwegsicherheit: „Wir können nicht nach Gutdünken Tempo 30 einführen“ Autofreie Schulstraßen wie in Paris sieht sie skeptisch. Dafür ihr Rat an alle Kinder und Jugendlichen: „Helm
Tweet media one
Tweet media two
Tweet media three
Tweet media four
89
403
2K
@haeggee
Alex Hägele
11 days
Long in the making, finally released: Apertus-8B and Apertus-70B, trained on 15T tokens of open data from over 1800 languages. Unique opportunity in academia to work on and train LLMs across the full-stack. We managed to pull off a pretraining run with some fun innovations, ...
Tweet media one
@cscsch
CSCS Lugano
11 days
@EPFL , @ETH_en and #CSCS today released Apertus, Switzerland's first large-scale, multilingual language model (LLM). As a fully open LLM, it serves as a building block for developers and organizations to create their own applications: https://t.co/7bJlINiIdn #Apertus #AI
Tweet media one
10
28
226
@FSchaipp
Fabian Schaipp
14 days
from time to time, read the classics
1
1
7
@FSchaipp
Fabian Schaipp
2 months
FYI Adam will be on holiday the entire august
@2prime_PKU
Yiping Lu
2 months
Anyone knows adam?
Tweet media one
0
0
8
@FSchaipp
Fabian Schaipp
2 months
Best tutorials are those that do not only promote the speakers' own work. #ICML2025
0
3
31
@FSchaipp
Fabian Schaipp
2 months
🚡 Come check out our poster on understanding LR schedules at ICML. Thursday 11am.
Tweet media one
4
10
118
@FSchaipp
Fabian Schaipp
2 months
Pogacar didnt have a bad day since TdF 2023, stage 17. Quite astonishing #TDF2025
1
0
0
@FSchaipp
Fabian Schaipp
3 months
A paper that contains both the words "sigma-algebra" and "SwiGLU activations" ☑️ Also interesting results on embedding layer LRs.
Tweet media one
Tweet media two
0
8
79
@mblondel_ml
Mathieu Blondel
3 months
We uploaded V3 of our draft book "The Elements of Differentiable Programming". Lots of typo fixes, clarity improvements, new figures and a new section on Transformers!
Tweet card summary image
arxiv.org
Artificial intelligence has recently experienced remarkable advances, fueled by large models, vast datasets, accelerated hardware, and, last but not least, the transformative power of...
8
78
460
@FSchaipp
Fabian Schaipp
3 months
is it allowed to write papers on μP only subject to using the most un-intuitive notation?
3
0
12
@FSchaipp
Fabian Schaipp
3 months
on a more serious note: thanks to @fpedregosa and colleagues for this benchmark. happy to see MoMo works reasonably well out of the box on problems we never tested it on
0
0
5
@FSchaipp
Fabian Schaipp
3 months
6 months arxiv upload pause, please. i can't catch up
@konstmish
Konstantin Mishchenko
3 months
Anyone working on adaptive optimization methods and replacements for Adam should check this paper.
Tweet media one
4
5
84
@FSchaipp
Fabian Schaipp
4 months
✒️ Cycle length of one is also optimal for the suboptimality bound we consider. The empirical loss curves and the bound for different cycle lengths match again.
Tweet media one
0
0
2
@FSchaipp
Fabian Schaipp
4 months
Short thread on (imo) neat finding of our LR-schedules paper: The Chinchilla paper showed that *cosine cycle of 1* works best for pretraining. That is, the cosine schedule should do exactly one half-cosine stretched over training. Why is this?
Tweet media one
Tweet media two
1
2
16
@FSchaipp
Fabian Schaipp
4 months
what are the best ressources for training and inference setup in diffusion models? ideally with (pseudo-)code
2
0
4
@FSchaipp
Fabian Schaipp
4 months
Optimization is the natural language of applied mathematics.
1
0
11