jianlin.su @Jianlin_S X Profile

jianlin.su

@Jianlin_S

Followers

4K

Following

43

Media

5

Statuses

100

Grad&Clip is all you need @Kimi_Moonshot Blog: https://t.co/YVxsWykMw2 , Cool Papers: https://t.co/scS1n1o0lg

https://t.co/YVxsWykMw2

Joined February 2025

Don't wanna be here? Send us removal request.

jianlin.su

@Jianlin_S

6 days

Low-precision attention may suffer from biased rounding errors https://t.co/0hxHG3tPu2

1

14

142

jianlin.su

@Jianlin_S

11 days

Beyond MuP: 1. The Self-Cultivation of a Good Model https://t.co/kR1XkNOPgd This series will share some top-down attempts at model optimization -- an extension and expansion of the ideas of MuP and Muon.

1

13

119

jianlin.su

@Jianlin_S

20 days

Fast (but non-rigorous) estimation of the spectral norm of random matrices https://t.co/kF0fPC5XLu

2

13

96

jianlin.su

@Jianlin_S

24 days

DiVeQ: A Very Concise Training Method for VQ https://t.co/ty29AvOGCw

1

5

46

jianlin.su

@Jianlin_S

27 days

Why does linear attention need Short Conv? https://t.co/luUybG3RXj

2

29

203

jianlin.su

@Jianlin_S

1 month

Asymptotic Estimate of Weight RMS for AdamW https://t.co/9Lxm4c2occ

0

17

126

jianlin.su

@Jianlin_S

1 month

Rethinking the Relationship Between Learning Rate and Batch Size (Part IV): EMA https://t.co/3xe0mWOUsz

2

11

138

jianlin.su

@Jianlin_S

2 months

Rethinking the Relationship Between Learning Rate and Batch Size (Part III): Muon https://t.co/DhHKMogazJ

3

26

161

jianlin.su

@Jianlin_S

2 months

Rethinking the Relationship Between Learning Rate and Batch Size (Part II): Mean Field https://t.co/9LTYdmJBbg

0

7

43

jianlin.su

@Jianlin_S

2 months

Why is Adam's Update RMS 0.2? https://t.co/nauRUZVjqt TLDR: Adam_Update_RMS ≈ sqrt((1 - beta1) / (1 + beta1))

4

17

154

jianlin.su

@Jianlin_S

2 months

Rethinking the Relationship Between Learning Rate and Batch Size (Part I): Current Status https://t.co/AGUw2aEULB

2

15

172

jianlin.su

@Jianlin_S

2 months

Cool Papers + Zotero https://t.co/XNAzKRBTQU https://t.co/xPpakWvPr2

0

5

34

jianlin.su

@Jianlin_S

2 months

Muon + Spectral Sphere:

2

18

186

jianlin.su

@Jianlin_S

3 months

A fun fact: Adam remains the dominant optimizer today, yet even it has had only scant opportunities to be verified on trillion-parameter models; Muon, proposed less than a year ago, has already trained at that scale.

4

6

130

jianlin.su

@Jianlin_S

3 months

Muon + Stiefiel https://t.co/w0gmk91hXd solved the open problem at https://t.co/yGbJDTtJlK from @jxbz . cc @leloykun

docs.modula.systems

📚 This page contains original research. To cite the Modula docs, here’s some BibTeX: On this page, we will work out an algorithm for performing gradient descent on the manifold of orthogonal...

0

18

200

jianlin.su

@Jianlin_S

3 months

https://t.co/ZbauVxDyQF Muon on orthogonal manifold, first from @jxbz

0

10

96

jianlin.su

@Jianlin_S

3 months

https://t.co/6aqqQXec0v This series opener explores the steepest-descent direction for equality-constrained optimization, starting with an SGD variant tailored to the hypersphere constraint.

0

19

jianlin.su

@Jianlin_S

3 months

https://t.co/PQVwrJ8ity Extend the last article to calculate any G P^{-s/r}

0

2

21

jianlin.su

@Jianlin_S

4 months

https://t.co/XB02GaAB2c a pretty method for solving P^{1/2}, P^{-1/2} and GP^{-1/2}, reusing the coefs of msign.

0

5

23