gowerrobert Profile Banner
Robert M. Gower πŸ‡ΊπŸ‡¦ Profile
Robert M. Gower πŸ‡ΊπŸ‡¦

@gowerrobert

Followers
1K
Following
4K
Media
97
Statuses
524

Often found scribbling down math with intermittent bursts of bashing out code.

New York City, USA
Joined June 2011
Don't wanna be here? Send us removal request.
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
8 months
Do you want to do a Postdoc developing new methods/theory in Optimization for deep learning/ML? Do you enjoy bluesky open research and discussions on black boards? Then Apply to the Flatiron Fellowship in the Center of Computational Mathematics 1/3
Tweet media one
1
7
20
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
11 days
And now I've made some slides on the PolarExpress: Though I omit our "cushioning" trick to make the method numerically stable in low precision.
0
0
3
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
RT @noahamsel: How can classical numerical analysis help train deep nets faster? Climb aboard the Polar Express to find out. https://t.c….
0
2
0
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
For more details check out our paper:. Lead by the brilliant @orvieto_antonio.
3
1
15
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
This leaves an open question which I couldn't answer:.1. Why do we use mean/\sqrt{mean^2 +var^2} as the update direction? Using mean as a noiser makes sense. Down weighting by variance also makes sense. By why exactly this?. I'd be interested to hear thoughts and ideas on this.
3
0
7
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
the two momentum buffers of Adam. Said in my own words, the most elegant method I can think of for tracking mean/variance online, gives the Adam momentum buffers.
Tweet media one
1
0
8
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
Suppose you are trying your best to track the mean and variance of the gradients as they change across iterations. Given a new gradient, you update mean/var by max the pdf of a Gauss + regularizing using your current estimate of mean/var. The solution to this gives exactly .
Tweet media one
1
0
5
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
When Ξ²_1=Ξ²_2, we can first re-write Adam as below, where instead of the standard uncentered 2nd momentum, we have something that looks a weird variance estimator. Fun fact, it is an online estimate of variance! Let me explain .
Tweet media one
1
0
12
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
First, turns out that when you set Ξ²_1=Ξ²_2 in Adam(W), it's generally a better default setting for training LLMs (see link and @orvieto_antonio 's extensive ablation). Since we only care about training LLMs now (πŸ˜›), let's fix Ξ²_1=Ξ²_2.
@orvieto_antonio
Antonio Orvieto
1 month
Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce?. @gowerrobert and I found that it has to do with the beta parameters and variational inference.
Tweet media one
1
0
14
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
Adam(W) is great at optimizing LLMs and most neural networks, but it's still not well understood. Optimizers try to explain Adam from their perspective of 1st and 2nd order methods. But maybe Adam has a more statistical motivation? Let me show you a mean/variance view . 1/x.
5
18
217
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
RT @damekdavis: nice paper by @gowerrobert and collaborators
Tweet media one
0
5
0
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
For full details see our new arXiv paper:. All the heavy lifting, and method development was done/lead by @noahamsel and @davpersson. Collab with Prof Chris Musco as well. Big thanks to @jxbz, @kellerjordan0 and @YouJiacheng for leading the way here!
Tweet media one
1
0
9
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
We got an improved validation loss as compared to using @kellerjordan0 and @YouJiacheng variants within the Muon method. In fact we got a better loss for all choices of the (max) learning rate. (Training the GPT2-Large model on 1B tokens of fineweb).
Tweet media one
1
0
1
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
The final PolarExpress method for T=5 is simple, and has the same structure as the variants fo the zeropower_via_newtonschulz5 developed by @kellerjordan0 and @YouJiacheng. So it's easy to drop in our version if you want to try. We tried it out and .
Tweet media one
1
0
5
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
To be fast, we run these methods in half precision (floatb16). And optimal polynomials in infinite precision, are not always optimal in finite precision. To make these methods stable in half precision, we had to squish the polynomials from above and below by 1%.
Tweet media one
1
0
1
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
We compute the *optimal* sequence of polynomials for approximating the matrix sign, which is equivalent to computing the optimal polynomials for approximating the constant *f(x) =1* function. This part is computed just once with the Remez algorithm, and then stored.
Tweet media one
1
0
1
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
Following @kellerjordan0 and @YouJiacheng, we use a sequence of T=5 odd polynomial to approximate the matrix sign. Polynomials of matrices are very GPU friendly, because they just use matrix multiply and matrix sums. After applying these polynomials we have X_T β‰ˆ polar(M)
Tweet media one
1
0
2
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
Each iteration of Muon has to (approx) compute the polar factor of the momentum matrix. At face value, looks like we have to compute the SVD and throw away the singular values. This is expensive and typically doesn't work well on GPU, so instead .
Tweet media one
1
0
5
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
1 month
Are you interested in the new Muon/Scion/Gluon method for training LLMs? .To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x
Tweet media one
3
24
194
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
2 months
RT @bremen79: I have an opening for a post-doc position: I am looking for smart people with a strong CV in optimization and/or online learn….
0
34
0
@gowerrobert
Robert M. Gower πŸ‡ΊπŸ‡¦
2 months
To add a little balance here, I have had several good reviewers in the past, who have also rejected my papers, but really interacted with the content, found an issue I had not considered, and gave valuable feedback. I'm grateful to those anonymous experts πŸ™.
0
0
5