Robert M. Gower πΊπ¦
@gowerrobert
Followers
2K
Following
4K
Media
97
Statuses
528
Often found scribbling down math with intermittent bursts of bashing out code.
New York City, USA
Joined June 2011
Do you want to do a Postdoc developing new methods/theory in Optimization for deep learning/ML? Do you enjoy bluesky open research and discussions on black boards? Then Apply to the Flatiron Fellowship in the Center of Computational Mathematics https://t.co/ydXX28xmAd 1/3
1
7
21
Fisher meets Feynman! π€ We use score matching and a trick from quantum field theory to make a product-of-experts family both expressive and efficient for variational inference. To appear as a spotlight @ NeurIPS 2025. #NeurIPS2025 (link below)
4
46
406
Call for participation:Β KAUST Workshop on Distributed Training in the Era of Large Models https://t.co/RkSsUkP5Mx location: KAUST, Saudi Arabia dates:Β Nov 24-26, 2025. There will be a chance for some participants to present a poster and/or give a lightning talk.
2
9
22
Come to HilD tomorrow @ICML2025 ! We have 4 posters on optimization: - In Search of Adamβs Secret Sauce - Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling - On the Interaction of Noise, Compression Role, and Adaptivity under (L0,L1)-Smoothness
0
5
49
https://t.co/hMeYylJzOE The latest progress in finding better Newton-Schulz iterations for the msign operator. It directly derives the theoretical optimal solution through the equioscillation theorem and greedy transformation. Original paper:
0
5
32
And now I've made some slides on the PolarExpress: https://t.co/Ftfe87EMK6 Though I omit our "cushioning" trick to make the method numerically stable in low precision.
docs.google.com
Today I will talk about a collaboration that really came about because of the mix of numerical analysis and Machine learning expertise at the Flatiron
0
0
2
How can classical numerical analysis help train deep nets faster? Climb aboard the Polar Express to find out... https://t.co/mSxkXHvizG joint with @davpersson @gowerrobert + Chris Musco
0
2
9
For more details check out our paper: https://t.co/LKDl4aOFB5 Lead by the brilliant @orvieto_antonio.
arxiv.org
Understanding the remarkable efficacy of Adam when training transformer-based language models has become a central research topic within the optimization community. To gain deeper insights,...
3
1
15
This leaves an open question which I couldn't answer: 1. Why do we use mean/\sqrt{mean^2 +var^2} as the update direction? Using mean as a noiser makes sense. Down weighting by variance also makes sense. By why exactly this? I'd be interested to hear thoughts and ideas on this.
3
0
7
the two momentum buffers of Adam. Said in my own words, the most elegant method I can think of for tracking mean/variance online, gives the Adam momentum buffers.
1
0
8
Suppose you are trying your best to track the mean and variance of the gradients as they change across iterations. Given a new gradient, you update mean/var by max the pdf of a Gauss + regularizing using your current estimate of mean/var. The solution to this gives exactly ...
1
0
5
When Ξ²_1=Ξ²_2, we can first re-write Adam as below, where instead of the standard uncentered 2nd momentum, we have something that looks a weird variance estimator. Fun fact, it is an online estimate of variance! Let me explain ...
1
0
12
First, turns out that when you set Ξ²_1=Ξ²_2 in Adam(W), it's generally a better default setting for training LLMs (see link and @orvieto_antonio 's extensive ablation). Since we only care about training LLMs now (π), let's fix Ξ²_1=Ξ²_2. https://t.co/pOHMdyI1xj
Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? @gowerrobert and I found that it has to do with the beta parameters and variational inference.
1
0
14
Adam(W) is great at optimizing LLMs and most neural networks, but it's still not well understood. Optimizers try to explain Adam from their perspective of 1st and 2nd order methods. But maybe Adam has a more statistical motivation? Let me show you a mean/variance view ..1/x
5
18
216
For full details see our new arXiv paper: https://t.co/y0Vhk4FggT All the heavy lifting, and method development was done/lead by @noahamsel and @davpersson. Collab with Prof Chris Musco as well. Big thanks to @jxbz, @kellerjordan0 and @YouJiacheng for leading the way here!
1
0
9
We got an improved validation loss as compared to using @kellerjordan0 and @YouJiacheng variants within the Muon method. In fact we got a better loss for all choices of the (max) learning rate. (Training the GPT2-Large model on 1B tokens of fineweb).
1
0
1
The final PolarExpress method for T=5 is simple, and has the same structure as the variants fo the zeropower_via_newtonschulz5 developed by @kellerjordan0 and @YouJiacheng. So it's easy to drop in our version if you want to try. We tried it out and ....
1
0
5
To be fast, we run these methods in half precision (floatb16). And optimal polynomials in infinite precision, are not always optimal in finite precision. To make these methods stable in half precision, we had to squish the polynomials from above and below by 1%.
1
0
1
We compute the *optimal* sequence of polynomials for approximating the matrix sign, which is equivalent to computing the optimal polynomials for approximating the constant *f(x) =1* function. This part is computed just once with the Remez algorithm, and then stored.
1
0
1
Following @kellerjordan0 and @YouJiacheng, we use a sequence of T=5 odd polynomial to approximate the matrix sign. Polynomials of matrices are very GPU friendly, because they just use matrix multiply and matrix sums. After applying these polynomials we have X_T β polar(M)
1
0
2