
Robert M. Gower πΊπ¦
@gowerrobert
Followers
1K
Following
4K
Media
97
Statuses
524
Often found scribbling down math with intermittent bursts of bashing out code.
New York City, USA
Joined June 2011
RT @noahamsel: How can classical numerical analysis help train deep nets faster? Climb aboard the Polar Express to find out. https://t.cβ¦.
0
2
0
First, turns out that when you set Ξ²_1=Ξ²_2 in Adam(W), it's generally a better default setting for training LLMs (see link and @orvieto_antonio 's extensive ablation). Since we only care about training LLMs now (π), let's fix Ξ²_1=Ξ²_2.
Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce?. @gowerrobert and I found that it has to do with the beta parameters and variational inference.
1
0
14
For full details see our new arXiv paper:. All the heavy lifting, and method development was done/lead by @noahamsel and @davpersson. Collab with Prof Chris Musco as well. Big thanks to @jxbz, @kellerjordan0 and @YouJiacheng for leading the way here!
1
0
9
We got an improved validation loss as compared to using @kellerjordan0 and @YouJiacheng variants within the Muon method. In fact we got a better loss for all choices of the (max) learning rate. (Training the GPT2-Large model on 1B tokens of fineweb).
1
0
1
The final PolarExpress method for T=5 is simple, and has the same structure as the variants fo the zeropower_via_newtonschulz5 developed by @kellerjordan0 and @YouJiacheng. So it's easy to drop in our version if you want to try. We tried it out and .
1
0
5
Following @kellerjordan0 and @YouJiacheng, we use a sequence of T=5 odd polynomial to approximate the matrix sign. Polynomials of matrices are very GPU friendly, because they just use matrix multiply and matrix sums. After applying these polynomials we have X_T β polar(M)
1
0
2
RT @bremen79: I have an opening for a post-doc position: I am looking for smart people with a strong CV in optimization and/or online learnβ¦.
0
34
0