jianlin.su
@Jianlin_S
Followers
4K
Following
43
Media
5
Statuses
100
Grad&Clip is all you need @Kimi_Moonshot Blog: https://t.co/YVxsWykMw2 , Cool Papers: https://t.co/scS1n1o0lg
Joined February 2025
Low-precision attention may suffer from biased rounding errors https://t.co/0hxHG3tPu2
1
14
142
Beyond MuP: 1. The Self-Cultivation of a Good Model https://t.co/kR1XkNOPgd This series will share some top-down attempts at model optimization -- an extension and expansion of the ideas of MuP and Muon.
1
13
119
Fast (but non-rigorous) estimation of the spectral norm of random matrices https://t.co/kF0fPC5XLu
2
13
96
DiVeQ: A Very Concise Training Method for VQ https://t.co/ty29AvOGCw
1
5
46
Why does linear attention need Short Conv? https://t.co/luUybG3RXj
2
29
203
Rethinking the Relationship Between Learning Rate and Batch Size (Part IV): EMA https://t.co/3xe0mWOUsz
2
11
138
Rethinking the Relationship Between Learning Rate and Batch Size (Part III): Muon https://t.co/DhHKMogazJ
3
26
161
Rethinking the Relationship Between Learning Rate and Batch Size (Part II): Mean Field https://t.co/9LTYdmJBbg
0
7
43
Why is Adam's Update RMS 0.2? https://t.co/nauRUZVjqt TLDR: Adam_Update_RMS ≈ sqrt((1 - beta1) / (1 + beta1))
4
17
154
Rethinking the Relationship Between Learning Rate and Batch Size (Part I): Current Status https://t.co/AGUw2aEULB
2
15
172
Cool Papers + Zotero https://t.co/XNAzKRBTQU
https://t.co/xPpakWvPr2
0
5
34
A fun fact: Adam remains the dominant optimizer today, yet even it has had only scant opportunities to be verified on trillion-parameter models; Muon, proposed less than a year ago, has already trained at that scale.
4
6
130
Muon + Stiefiel https://t.co/w0gmk91hXd solved the open problem at https://t.co/yGbJDTtJlK from @jxbz . cc @leloykun
docs.modula.systems
📚 This page contains original research. To cite the Modula docs, here’s some BibTeX: On this page, we will work out an algorithm for performing gradient descent on the manifold of orthogonal...
0
18
200
https://t.co/ZbauVxDyQF Muon on orthogonal manifold, first from @jxbz
0
10
96
https://t.co/6aqqQXec0v This series opener explores the steepest-descent direction for equality-constrained optimization, starting with an SGD variant tailored to the hypersphere constraint.
0
0
19
https://t.co/PQVwrJ8ity Extend the last article to calculate any G P^{-s/r}
0
2
21
https://t.co/XB02GaAB2c a pretty method for solving P^{1/2}, P^{-1/2} and GP^{-1/2}, reusing the coefs of msign.
0
5
23