Robert M. Gower 🇺🇦 @gowerrobert X Profile

Robert M. Gower 🇺🇦

@gowerrobert

Followers

1K

Following

4K

Media

97

Statuses

524

Often found scribbling down math with intermittent bursts of bashing out code.

New York City, USA

Joined June 2011

Don't wanna be here? Send us removal request.

Robert M. Gower 🇺🇦

@gowerrobert

8 months

Do you want to do a Postdoc developing new methods/theory in Optimization for deep learning/ML? Do you enjoy bluesky open research and discussions on black boards? Then Apply to the Flatiron Fellowship in the Center of Computational Mathematics 1/3

1

7

20

Robert M. Gower 🇺🇦

@gowerrobert

11 days

And now I've made some slides on the PolarExpress: Though I omit our "cushioning" trick to make the method numerically stable in low precision.

0

3

Robert M. Gower 🇺🇦

@gowerrobert

1 month

RT @noahamsel: How can classical numerical analysis help train deep nets faster? Climb aboard the Polar Express to find out. https://t.c….

0

2

0

Robert M. Gower 🇺🇦

@gowerrobert

1 month

For more details check out our paper:. Lead by the brilliant @orvieto_antonio.

3

1

15

Robert M. Gower 🇺🇦

@gowerrobert

1 month

This leaves an open question which I couldn't answer:.1. Why do we use mean/\sqrt{mean^2 +var^2} as the update direction? Using mean as a noiser makes sense. Down weighting by variance also makes sense. By why exactly this?. I'd be interested to hear thoughts and ideas on this.

3

0

7

Robert M. Gower 🇺🇦

@gowerrobert

1 month

the two momentum buffers of Adam. Said in my own words, the most elegant method I can think of for tracking mean/variance online, gives the Adam momentum buffers.

1

0

8

Robert M. Gower 🇺🇦

@gowerrobert

1 month

Suppose you are trying your best to track the mean and variance of the gradients as they change across iterations. Given a new gradient, you update mean/var by max the pdf of a Gauss + regularizing using your current estimate of mean/var. The solution to this gives exactly .

1

0

5

Robert M. Gower 🇺🇦

@gowerrobert

1 month

When β_1=β_2, we can first re-write Adam as below, where instead of the standard uncentered 2nd momentum, we have something that looks a weird variance estimator. Fun fact, it is an online estimate of variance! Let me explain .

1

0

12

Robert M. Gower 🇺🇦

@gowerrobert

1 month

First, turns out that when you set β_1=β_2 in Adam(W), it's generally a better default setting for training LLMs (see link and @orvieto_antonio 's extensive ablation). Since we only care about training LLMs now (😛), let's fix β_1=β_2.

Antonio Orvieto

@orvieto_antonio

1 month

Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce?. @gowerrobert and I found that it has to do with the beta parameters and variational inference.

1

0

14

Robert M. Gower 🇺🇦

@gowerrobert

1 month

Adam(W) is great at optimizing LLMs and most neural networks, but it's still not well understood. Optimizers try to explain Adam from their perspective of 1st and 2nd order methods. But maybe Adam has a more statistical motivation? Let me show you a mean/variance view . 1/x.

5

18

217

Robert M. Gower 🇺🇦

@gowerrobert

1 month

RT @damekdavis: nice paper by @gowerrobert and collaborators

0

5

0

Robert M. Gower 🇺🇦

@gowerrobert

1 month

For full details see our new arXiv paper:. All the heavy lifting, and method development was done/lead by @noahamsel and @davpersson. Collab with Prof Chris Musco as well. Big thanks to @jxbz, @kellerjordan0 and @YouJiacheng for leading the way here!

1

0

9

Robert M. Gower 🇺🇦

@gowerrobert

1 month

We got an improved validation loss as compared to using @kellerjordan0 and @YouJiacheng variants within the Muon method. In fact we got a better loss for all choices of the (max) learning rate. (Training the GPT2-Large model on 1B tokens of fineweb).

1

0

1

Robert M. Gower 🇺🇦

@gowerrobert

1 month

The final PolarExpress method for T=5 is simple, and has the same structure as the variants fo the zeropower_via_newtonschulz5 developed by @kellerjordan0 and @YouJiacheng. So it's easy to drop in our version if you want to try. We tried it out and .

1

0

5

Robert M. Gower 🇺🇦

@gowerrobert

1 month

To be fast, we run these methods in half precision (floatb16). And optimal polynomials in infinite precision, are not always optimal in finite precision. To make these methods stable in half precision, we had to squish the polynomials from above and below by 1%.

1

0

1

Robert M. Gower 🇺🇦

@gowerrobert

1 month

We compute the *optimal* sequence of polynomials for approximating the matrix sign, which is equivalent to computing the optimal polynomials for approximating the constant *f(x) =1* function. This part is computed just once with the Remez algorithm, and then stored.

1

0

1

Robert M. Gower 🇺🇦

@gowerrobert

1 month

Following @kellerjordan0 and @YouJiacheng, we use a sequence of T=5 odd polynomial to approximate the matrix sign. Polynomials of matrices are very GPU friendly, because they just use matrix multiply and matrix sums. After applying these polynomials we have X_T ≈ polar(M)

1

0

2

Robert M. Gower 🇺🇦

@gowerrobert

1 month

Each iteration of Muon has to (approx) compute the polar factor of the momentum matrix. At face value, looks like we have to compute the SVD and throw away the singular values. This is expensive and typically doesn't work well on GPU, so instead .

1

0

5

Robert M. Gower 🇺🇦

@gowerrobert

1 month

Are you interested in the new Muon/Scion/Gluon method for training LLMs? .To run Muon, you need to approximate the matrix sign (or polar factor) of the momentum matrix. We've developed an optimal method *The PolarExpress* just for this! If you're interested, climb aboard 1/x

3

24

194

Robert M. Gower 🇺🇦

@gowerrobert

2 months

RT @bremen79: I have an opening for a post-doc position: I am looking for smart people with a strong CV in optimization and/or online learn….

0

34

0

Robert M. Gower 🇺🇦

@gowerrobert

2 months

To add a little balance here, I have had several good reviewers in the past, who have also rejected my papers, but really interacted with the content, found an issue I had not considered, and gave valuable feedback. I'm grateful to those anonymous experts 🙏.

0

5