gowerrobert Profile Banner
Robert M. Gower πŸ‡ΊπŸ‡¦ Profile
Robert M. Gower πŸ‡ΊπŸ‡¦

@gowerrobert

Followers
2K
Following
4K
Media
97
Statuses
528

Often found scribbling down math with intermittent bursts of bashing out code.

New York City, USA
Joined June 2011
Don't wanna be here? Send us removal request.
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 year
Do you want to do a Postdoc developing new methods/theory in Optimization for deep learning/ML? Do you enjoy bluesky open research and discussions on black boards? Then Apply to the Flatiron Fellowship in the Center of Computational Mathematics https://t.co/ydXX28xmAd 1/3
1
7
21
@dianarycai
Diana Cai
9 days
Fisher meets Feynman! 🀝 We use score matching and a trick from quantum field theory to make a product-of-experts family both expressive and efficient for variational inference. To appear as a spotlight @ NeurIPS 2025. #NeurIPS2025 (link below)
4
46
406
@peter_richtarik
Peter Richtarik
21 days
Call for participation:Β  KAUST Workshop on Distributed Training in the Era of Large Models https://t.co/RkSsUkP5Mx location: KAUST, Saudi Arabia dates:Β Nov 24-26, 2025. There will be a chance for some participants to present a poster and/or give a lightning talk.
2
9
22
@orvieto_antonio
Antonio Orvieto
4 months
Come to HilD tomorrow @ICML2025 ! We have 4 posters on optimization: - In Search of Adam’s Secret Sauce - Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling - On the Interaction of Noise, Compression Role, and Adaptivity under (L0,L1)-Smoothness
0
5
49
@Jianlin_S
jianlin.su
5 months
https://t.co/hMeYylJzOE The latest progress in finding better Newton-Schulz iterations for the msign operator. It directly derives the theoretical optimal solution through the equioscillation theorem and greedy transformation. Original paper:
0
5
32
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
4 months
And now I've made some slides on the PolarExpress: https://t.co/Ftfe87EMK6 Though I omit our "cushioning" trick to make the method numerically stable in low precision.
Tweet card summary image
docs.google.com
Today I will talk about a collaboration that really came about because of the mix of numerical analysis and Machine learning expertise at the Flatiron
0
0
2
@noahamsel
noahamsel
5 months
How can classical numerical analysis help train deep nets faster? Climb aboard the Polar Express to find out... https://t.co/mSxkXHvizG joint with @davpersson @gowerrobert + Chris Musco
0
2
9
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
This leaves an open question which I couldn't answer: 1. Why do we use mean/\sqrt{mean^2 +var^2} as the update direction? Using mean as a noiser makes sense. Down weighting by variance also makes sense. By why exactly this? I'd be interested to hear thoughts and ideas on this.
3
0
7
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
the two momentum buffers of Adam. Said in my own words, the most elegant method I can think of for tracking mean/variance online, gives the Adam momentum buffers.
1
0
8
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
Suppose you are trying your best to track the mean and variance of the gradients as they change across iterations. Given a new gradient, you update mean/var by max the pdf of a Gauss + regularizing using your current estimate of mean/var. The solution to this gives exactly ...
1
0
5
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
When Ξ²_1=Ξ²_2, we can first re-write Adam as below, where instead of the standard uncentered 2nd momentum, we have something that looks a weird variance estimator. Fun fact, it is an online estimate of variance! Let me explain ...
1
0
12
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
First, turns out that when you set Ξ²_1=Ξ²_2 in Adam(W), it's generally a better default setting for training LLMs (see link and @orvieto_antonio 's extensive ablation). Since we only care about training LLMs now (πŸ˜›), let's fix Ξ²_1=Ξ²_2. https://t.co/pOHMdyI1xj
@orvieto_antonio
Antonio Orvieto
5 months
Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? @gowerrobert and I found that it has to do with the beta parameters and variational inference.
1
0
14
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
Adam(W) is great at optimizing LLMs and most neural networks, but it's still not well understood. Optimizers try to explain Adam from their perspective of 1st and 2nd order methods. But maybe Adam has a more statistical motivation? Let me show you a mean/variance view ..1/x
5
18
216
@damekdavis
Damek
5 months
nice paper by @gowerrobert and collaborators
2
5
75
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
For full details see our new arXiv paper: https://t.co/y0Vhk4FggT All the heavy lifting, and method development was done/lead by @noahamsel and @davpersson. Collab with Prof Chris Musco as well. Big thanks to @jxbz, @kellerjordan0 and @YouJiacheng for leading the way here!
1
0
9
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
We got an improved validation loss as compared to using @kellerjordan0 and @YouJiacheng variants within the Muon method. In fact we got a better loss for all choices of the (max) learning rate. (Training the GPT2-Large model on 1B tokens of fineweb).
1
0
1
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
The final PolarExpress method for T=5 is simple, and has the same structure as the variants fo the zeropower_via_newtonschulz5 developed by @kellerjordan0 and @YouJiacheng. So it's easy to drop in our version if you want to try. We tried it out and ....
1
0
5
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
To be fast, we run these methods in half precision (floatb16). And optimal polynomials in infinite precision, are not always optimal in finite precision. To make these methods stable in half precision, we had to squish the polynomials from above and below by 1%.
1
0
1
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
We compute the *optimal* sequence of polynomials for approximating the matrix sign, which is equivalent to computing the optimal polynomials for approximating the constant *f(x) =1* function. This part is computed just once with the Remez algorithm, and then stored.
1
0
1
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
5 months
Following @kellerjordan0 and @YouJiacheng, we use a sequence of T=5 odd polynomial to approximate the matrix sign. Polynomials of matrices are very GPU friendly, because they just use matrix multiply and matrix sums. After applying these polynomials we have X_T β‰ˆ polar(M)
1
0
2