Seunghyun Seo @SeunghyunSEO7 X Profile

Seunghyun Seo

@SeunghyunSEO7

Followers

3K

Following

5K

Media

234

Statuses

1K

deep learning enjoyer. from speech to llm @ naver, now exploring image space @midjourney

South Korea

Joined July 2021

Don't wanna be here? Send us removal request.

Seunghyun Seo

@SeunghyunSEO7

6 months

btw, i wrote a post about "how to scale" based on what i've learned over the past few months. it covers muP, HP scaling laws, and some stuffs. would be happy to get any feedback or discussion. (it's pretty verbose and no TL;DR, sorry lol) https://t.co/Tfr2x8e4fl

12

79

704

JingyuanLiu

@JingyuanLiu123

6 hours

Some thoughts/experiences on this for people to onboard to use Muon (unfortunately, it is not as easy as element-wise AdamW to use ...). In the open-source community, I helped onboard Muon into PyTorch and Megatron-LM, and I usually follow these steps to develop Muon on a

arxiv.org

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We...

Yuchen Jin

@Yuchenj_UW

18 hours

Whenever people ask me, “Is Muon optimizer just hype?” I need to show them this. Muon isn’t just verified and used in Kimi; other frontier labs like OpenAI are using it and its variants. It’s also in PyTorch stable now!

3

10

92

Seunghyun Seo

@SeunghyunSEO7

5 days

Who cares about the truth... they just cite these to spread their thoughts.

0

1

Seunghyun Seo

@SeunghyunSEO7

5 days

???: South Korea push their people to death for 100 hours a week but their gdp per capita is still a third of america. ppl: "wow, i didnt know that" ???: It's past. they cant overwork now, it's illegal. there is also salary cap. everybody goes abroad. ppl: "wow i didnt know that"

Seunghyun Seo

@SeunghyunSEO7

5 days

@cloneofsimo Korea is a country that went through the Korean War and literally started from nothing. In the past, people might have actually worked 80 to 100 hours a week, but now the labor laws have changed so much that even those who want to overwork are legally not allowed to work more.

1

0

1

Seunghyun Seo

@SeunghyunSEO7

8 days

anybody can give me references for muon with fused layer? like qkv proj or parallel layer. i remember someone claims that separated qkv is better, but idk why considering more params is worse (treat muon as 2nd order methods). maybe due to poor lr scaling of fused layer’s shape?

8

2

38

Seunghyun Seo

@SeunghyunSEO7

10 days

i don't know their internal, but guess they put some efforts on precision for rl training. maybe that's one of their moat??

1

0

9

Seunghyun Seo

@SeunghyunSEO7

10 days

when i join the scene, fp16 mixed-precision training was dominant, and i personally love how oai folks worked hard to stabilize this in dall-e 1 paper. per block grad scaling, fp32 for sensitive ops, different format for m, v, clip v vy value to 5, but used powerSGD, lol

3

6

58

Seunghyun Seo

@SeunghyunSEO7

10 days

lmfao, this week is like - “who use full attn till now? -> linear attn sucks at agentic task (minimax) -> KDA is sota (kimi) -> …” - “who care fp16 in 2025? -> narrow dynamic ranges but higher precision matters for rl -> …”

liuyo

@iamdiegopy

11 days

the book is like 1 day old and things are already changing @eliebakouch @Thom_Wolf

5

7

134

Seunghyun Seo

@SeunghyunSEO7

11 days

just noticed modded-nanogpt adopt 'NorMuon' as default (?). it looks like `AdaMuon`. i personally didnt buy this idea because i thought Muon is enough and dont want to introduce optim state for 2nd moment again like adam... hmm https://t.co/pMAIOuNGW5 https://t.co/YG1hjC3BEY

6

7

65

Seunghyun Seo

@SeunghyunSEO7

11 days

some figures from my old slide for the information.

0

11

Seunghyun Seo

@SeunghyunSEO7

11 days

fwiw, when casting rollout actor as bf16 with vllm in 2023, most of samples are clipped out when using ppo even at first step (should log prob diff = 0) i suspected our early llm had wider activation and param range, so temporarily fixed this with fp16. https://t.co/zZzI5TgmIi

Rosinality

@rosinality

11 days

FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!

2

1

32

Seunghyun Seo

@SeunghyunSEO7

13 days

putting everything else aside, don’t like the leetcode style interviews either, but isn’t the whole point of coding interviews to see how a candidate approaches and solves a problem? tbh, I never really understood why people thought this kind of thing was ‘cool’.

Simo Ryu

@cloneofsimo

14 days

"Buy our product for 899$ to cheat your way in 200k$ job" In South Korea and Japan, you can leave your wallet and laptop in the public and nobody would steal. People will return your wallet if you lost in in the street. We have a collective 'social capital', where all implictly

2

0

8

Seunghyun Seo

@SeunghyunSEO7

14 days

feel the agi

0

1

45

Seunghyun Seo

@SeunghyunSEO7

14 days

4. and some publically available frontier models like dsv3 still use SP with low enough std like 0.006 (4/n) lately, I’ve been reflecting on how important data processing and large-scale engineering are for good production, imho.

2

0

19

Seunghyun Seo

@SeunghyunSEO7

14 days

3. muon's gain is actually 1.2~1.4x compared to well tuned adamw, and it shows diminishing returns in over-trained regime or larger model size (but i agree with that large critical batch size matters in large scale and 1.2x gain is still huge!) https://t.co/ocIPJqnw6K (3/n)

2

0

23

Seunghyun Seo

@SeunghyunSEO7

14 days

2. hidden matrices' curvature barely changes during training, except final layer https://t.co/gNyq5JyzhC (2/n)

1

0

14

Seunghyun Seo

@SeunghyunSEO7

14 days

my takes on recent studies about optimization is like how hard to dethrone adam(w) and SP combination in large scale... 1. muP plays a crucial role on early training but it enters a steady state after sufficient steps https://t.co/zShOookCEG https://t.co/prQ1e5wjAw (1/n)

arxiv.org

Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ($μ$P) enables learning-rate transfer across widths by equalizing early-time...

Atli Kosson

@AtliKosson

19 days

Why override µP? Because its core assumptions only hold very early in training! In practice wide models quickly stop being more sensitive to weight updates than smaller models! This is caused by changes in the geometric alignment of updates and layer inputs over training. 🧵6/8

2

7

93

Seunghyun Seo

@SeunghyunSEO7

16 days

Which side, Qwen?

Yuchen Jin

@Yuchenj_UW

17 days

Ok, I know why. Qwen 3: all-in on BTC, 20x leverage. GPT-5: diversified portfolio... but lost a lot. classic overthinking.

1

18

Seunghyun Seo

@SeunghyunSEO7

18 days

@AtliKosson oh sorry, they used adafactor where it's equivalent to adamw with param scaling lr, so it is more complicated. in addition, google's framework like tnesorflow gonna perform independent WD unlike torch when using adamw.

2

0

1

Seunghyun Seo

@SeunghyunSEO7

18 days

but, did you guys know that google used lr^2 WD? assume they scaled lr properly like 1/d rule. this simple muP transfer lr well even though it's not maximal. but if it's right, it is opposite to above works, it further scale down WD by 1/d^2 for all weights. wdyt @AtliKosson ?

1

0

4

Seunghyun Seo

@SeunghyunSEO7

18 days

interesting https://t.co/D7QaOwjktX

Atli Kosson

@AtliKosson

18 days

@cloneofsimo It’s not necessarily opposite, we discuss this a bit in the manuscript. For short runs WD doesn’t really matter, but it also turns out without WD all LRs kind of work the same in the long run. The size of the relative feature updates loses its dependence on the peak LR.

0

5