FSchaipp Profile Banner
Fabian Schaipp Profile
Fabian Schaipp

@FSchaipp

Followers
1K
Following
1K
Media
70
Statuses
419

working on optimization for machine learning. currently postdoc @inria_paris. sbatch and apero.

Paris, France
Joined July 2020
Don't wanna be here? Send us removal request.
@FSchaipp
Fabian Schaipp
5 months
Learning rate schedules seem mysterious?.Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper ๐Ÿš‡.
@aaron_defazio
Aaron Defazio
5 months
The sudden loss drop when annealing the learning rate at the end of a WSD (warmup-stable-decay) schedule can be explained without relying on non-convexity or even smoothness, a new paper shows that it can be precisely predicted by theory in the convex, non-smooth setting!.1/2
Tweet media one
5
26
131
@FSchaipp
Fabian Schaipp
12 days
A paper that contains both the words "sigma-algebra" and "SwiGLU activations" โ˜‘๏ธ. Also interesting results on embedding layer LRs.
Tweet media one
Tweet media two
0
8
76
@FSchaipp
Fabian Schaipp
12 days
RT @mblondel_ml: We uploaded V3 of our draft book "The Elements of Differentiable Programming". Lots of typo fixes, clarity improvements, nโ€ฆ.
0
75
0
@FSchaipp
Fabian Schaipp
20 days
is it allowed to write papers on ฮผP only subject to using the most un-intuitive notation?.
3
0
12
@FSchaipp
Fabian Schaipp
1 month
on a more serious note: thanks to @fpedregosa and colleagues for this benchmark. happy to see MoMo works reasonably well out of the box on problems we never tested it on.
0
0
5
@FSchaipp
Fabian Schaipp
1 month
6 months arxiv upload pause, please. i can't catch up.
@konstmish
Konstantin Mishchenko
1 month
Anyone working on adaptive optimization methods and replacements for Adam should check this paper.
Tweet media one
4
5
85
@FSchaipp
Fabian Schaipp
1 month
โœ’๏ธ Cycle length of one is also optimal for the suboptimality bound we consider. The empirical loss curves and the bound for different cycle lengths match again.
Tweet media one
0
0
2
@FSchaipp
Fabian Schaipp
1 month
Short thread on (imo) neat finding of our LR-schedules paper:. The Chinchilla paper showed that *cosine cycle of 1* works best for pretraining. That is, the cosine schedule should do exactly one half-cosine stretched over training. Why is this?
Tweet media one
Tweet media two
1
2
16
@FSchaipp
Fabian Schaipp
1 month
what are the best ressources for training and inference setup in diffusion models? ideally with (pseudo-)code.
2
0
4
@FSchaipp
Fabian Schaipp
1 month
Optimization is the natural language of applied mathematics.
1
0
11
@FSchaipp
Fabian Schaipp
2 months
Link:
0
0
0
@FSchaipp
Fabian Schaipp
2 months
stand up for a clean references.bib!. if you want all papers from @NeurIPSConf, @icmlconf and @iclr_conf in one single bib file, this is for you. Just updated with ICLR 2025 proceedings ๐Ÿ“š.
@FSchaipp
Fabian Schaipp
7 months
Want all NeurIPS/ICML/ICLR papers in one single .bib file? Here you go!. ๐Ÿ—ž๏ธ short blog: ๐Ÿ“‡ bib files:
1
1
9
@FSchaipp
Fabian Schaipp
2 months
now accepted at #ICML 2025! โ˜„๏ธ.
@FSchaipp
Fabian Schaipp
5 months
Learning rate schedules seem mysterious?.Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper ๐Ÿš‡.
1
3
50
@FSchaipp
Fabian Schaipp
2 months
biggest tech improvement in a while: my (android) phone can now open arxiv pdfs in the browser without downloading them ๐Ÿ“—.
0
0
9
@FSchaipp
Fabian Schaipp
2 months
related and interesting:.
0
0
4
@FSchaipp
Fabian Schaipp
2 months
what are the best empirical papers on minimal Hessian eigenvalue in deep learning during training (and similar loss landscape stuff)?.
2
0
4
@FSchaipp
Fabian Schaipp
3 months
RT @S_Conradi: IFS fractal. Made with #python #numpy #matplotlib
Tweet media one
0
40
0
@FSchaipp
Fabian Schaipp
3 months
this, but as (applied) maths research center
Tweet media one
1
0
7
@FSchaipp
Fabian Schaipp
3 months
Side note: you could do the same for any other assumed gradient norm shape. but this becomes a catch22, as the schedule also changes the gradient norms.
1
0
1
@FSchaipp
Fabian Schaipp
3 months
Figure explainer: yellow to purple are the iterates of the optimized schedule over time. The fun part: I used (projected) gradient descent to minimize for the schedule. All you need is write the bound in Pytorch, then use autodiff and voilรก.
1
0
1