Fabian Schaipp @FSchaipp X Profile

Fabian Schaipp

@FSchaipp

Followers

1K

Following

1K

Media

70

Statuses

419

working on optimization for machine learning. currently postdoc @inria_paris. sbatch and apero.

Paris, France

Joined July 2020

Don't wanna be here? Send us removal request.

Fabian Schaipp

@FSchaipp

5 months

Learning rate schedules seem mysterious?.Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper 🚇.

Aaron Defazio

@aaron_defazio

5 months

The sudden loss drop when annealing the learning rate at the end of a WSD (warmup-stable-decay) schedule can be explained without relying on non-convexity or even smoothness, a new paper shows that it can be precisely predicted by theory in the convex, non-smooth setting!.1/2

5

26

131

Fabian Schaipp

@FSchaipp

12 days

A paper that contains both the words "sigma-algebra" and "SwiGLU activations" ☑️. Also interesting results on embedding layer LRs.

0

8

76

Fabian Schaipp

@FSchaipp

12 days

RT @mblondel_ml: We uploaded V3 of our draft book "The Elements of Differentiable Programming". Lots of typo fixes, clarity improvements, n….

0

75

0

Fabian Schaipp

@FSchaipp

20 days

is it allowed to write papers on μP only subject to using the most un-intuitive notation?.

3

0

12

Fabian Schaipp

@FSchaipp

1 month

on a more serious note: thanks to @fpedregosa and colleagues for this benchmark. happy to see MoMo works reasonably well out of the box on problems we never tested it on.

0

5

Fabian Schaipp

@FSchaipp

1 month

6 months arxiv upload pause, please. i can't catch up.

Konstantin Mishchenko

@konstmish

1 month

Anyone working on adaptive optimization methods and replacements for Adam should check this paper.

4

5

85

Fabian Schaipp

@FSchaipp

1 month

✒️ Cycle length of one is also optimal for the suboptimality bound we consider. The empirical loss curves and the bound for different cycle lengths match again.

0

2

Fabian Schaipp

@FSchaipp

1 month

Short thread on (imo) neat finding of our LR-schedules paper:. The Chinchilla paper showed that *cosine cycle of 1* works best for pretraining. That is, the cosine schedule should do exactly one half-cosine stretched over training. Why is this?

1

2

16

Fabian Schaipp

@FSchaipp

1 month

what are the best ressources for training and inference setup in diffusion models? ideally with (pseudo-)code.

2

0

4

Fabian Schaipp

@FSchaipp

1 month

Optimization is the natural language of applied mathematics.

1

0

11

Fabian Schaipp

@FSchaipp

2 months

Link:

0

Fabian Schaipp

@FSchaipp

2 months

stand up for a clean references.bib!. if you want all papers from @NeurIPSConf, @icmlconf and @iclr_conf in one single bib file, this is for you. Just updated with ICLR 2025 proceedings 📚.

Fabian Schaipp

@FSchaipp

7 months

Want all NeurIPS/ICML/ICLR papers in one single .bib file? Here you go!. 🗞️ short blog: 📇 bib files:

1

9

Fabian Schaipp

@FSchaipp

2 months

now accepted at #ICML 2025! ☄️.

Fabian Schaipp

@FSchaipp

5 months

Learning rate schedules seem mysterious?.Turns out that their behaviour can be described with a bound from *convex, nonsmooth* optimization. Short thread on our latest paper 🚇.

1

3

50

Fabian Schaipp

@FSchaipp

2 months

biggest tech improvement in a while: my (android) phone can now open arxiv pdfs in the browser without downloading them 📗.

0

9

Fabian Schaipp

@FSchaipp

2 months

related and interesting:.

0

4

Fabian Schaipp

@FSchaipp

2 months

what are the best empirical papers on minimal Hessian eigenvalue in deep learning during training (and similar loss landscape stuff)?.

2

0

4

Fabian Schaipp

@FSchaipp

3 months

RT @S_Conradi: IFS fractal. Made with #python #numpy #matplotlib

0

40

0

Fabian Schaipp

@FSchaipp

3 months

this, but as (applied) maths research center

1

0

7

Fabian Schaipp

@FSchaipp

3 months

Side note: you could do the same for any other assumed gradient norm shape. but this becomes a catch22, as the schedule also changes the gradient norms.

1

0

1

Fabian Schaipp

@FSchaipp

3 months

Figure explainer: yellow to purple are the iterates of the optimized schedule over time. The fun part: I used (projected) gradient descent to minimize for the schedule. All you need is write the bound in Pytorch, then use autodiff and voilá.

1

0

1