jianlin.su Profile
jianlin.su

@Jianlin_S

Followers
4K
Following
43
Media
5
Statuses
100

Grad&Clip is all you need @Kimi_Moonshot Blog: https://t.co/YVxsWykMw2 , Cool Papers: https://t.co/scS1n1o0lg

Joined February 2025
Don't wanna be here? Send us removal request.
@Jianlin_S
jianlin.su
6 days
Low-precision attention may suffer from biased rounding errors https://t.co/0hxHG3tPu2
1
14
142
@Jianlin_S
jianlin.su
11 days
Beyond MuP: 1. The Self-Cultivation of a Good Model https://t.co/kR1XkNOPgd This series will share some top-down attempts at model optimization -- an extension and expansion of the ideas of MuP and Muon.
1
13
119
@Jianlin_S
jianlin.su
20 days
Fast (but non-rigorous) estimation of the spectral norm of random matrices https://t.co/kF0fPC5XLu
2
13
96
@Jianlin_S
jianlin.su
24 days
DiVeQ: A Very Concise Training Method for VQ https://t.co/ty29AvOGCw
1
5
46
@Jianlin_S
jianlin.su
27 days
Why does linear attention need Short Conv? https://t.co/luUybG3RXj
2
29
203
@Jianlin_S
jianlin.su
1 month
Asymptotic Estimate of Weight RMS for AdamW https://t.co/9Lxm4c2occ
0
17
126
@Jianlin_S
jianlin.su
1 month
Rethinking the Relationship Between Learning Rate and Batch Size (Part IV): EMA https://t.co/3xe0mWOUsz
2
11
138
@Jianlin_S
jianlin.su
2 months
Rethinking the Relationship Between Learning Rate and Batch Size (Part III): Muon https://t.co/DhHKMogazJ
3
26
161
@Jianlin_S
jianlin.su
2 months
Rethinking the Relationship Between Learning Rate and Batch Size (Part II): Mean Field https://t.co/9LTYdmJBbg
0
7
43
@Jianlin_S
jianlin.su
2 months
Why is Adam's Update RMS 0.2? https://t.co/nauRUZVjqt TLDR: Adam_Update_RMS ≈ sqrt((1 - beta1) / (1 + beta1))
4
17
154
@Jianlin_S
jianlin.su
2 months
Rethinking the Relationship Between Learning Rate and Batch Size (Part I): Current Status https://t.co/AGUw2aEULB
2
15
172
@Jianlin_S
jianlin.su
2 months
0
5
34
@Jianlin_S
jianlin.su
2 months
Muon + Spectral Sphere:
2
18
186
@Jianlin_S
jianlin.su
3 months
A fun fact: Adam remains the dominant optimizer today, yet even it has had only scant opportunities to be verified on trillion-parameter models; Muon, proposed less than a year ago, has already trained at that scale.
4
6
130
@Jianlin_S
jianlin.su
3 months
https://t.co/ZbauVxDyQF Muon on orthogonal manifold, first from @jxbz
0
10
96
@Jianlin_S
jianlin.su
3 months
https://t.co/6aqqQXec0v This series opener explores the steepest-descent direction for equality-constrained optimization, starting with an SGD variant tailored to the hypersphere constraint.
0
0
19
@Jianlin_S
jianlin.su
3 months
https://t.co/PQVwrJ8ity Extend the last article to calculate any G P^{-s/r}
0
2
21
@Jianlin_S
jianlin.su
4 months
https://t.co/XB02GaAB2c a pretty method for solving P^{1/2}, P^{-1/2} and GP^{-1/2}, reusing the coefs of msign.
0
5
23