Seunghyun Seo
@SeunghyunSEO7
Followers
3K
Following
5K
Media
234
Statuses
1K
deep learning enjoyer. from speech to llm @ naver, now exploring image space @midjourney
South Korea
Joined July 2021
btw, i wrote a post about "how to scale" based on what i've learned over the past few months. it covers muP, HP scaling laws, and some stuffs. would be happy to get any feedback or discussion. (it's pretty verbose and no TL;DR, sorry lol) https://t.co/Tfr2x8e4fl
12
79
704
Some thoughts/experiences on this for people to onboard to use Muon (unfortunately, it is not as easy as element-wise AdamW to use ...). In the open-source community, I helped onboard Muon into PyTorch and Megatron-LM, and I usually follow these steps to develop Muon on a
arxiv.org
Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We...
Whenever people ask me, “Is Muon optimizer just hype?” I need to show them this. Muon isn’t just verified and used in Kimi; other frontier labs like OpenAI are using it and its variants. It’s also in PyTorch stable now!
3
10
92
Who cares about the truth... they just cite these to spread their thoughts.
0
0
1
???: South Korea push their people to death for 100 hours a week but their gdp per capita is still a third of america. ppl: "wow, i didnt know that" ???: It's past. they cant overwork now, it's illegal. there is also salary cap. everybody goes abroad. ppl: "wow i didnt know that"
@cloneofsimo Korea is a country that went through the Korean War and literally started from nothing. In the past, people might have actually worked 80 to 100 hours a week, but now the labor laws have changed so much that even those who want to overwork are legally not allowed to work more.
1
0
1
anybody can give me references for muon with fused layer? like qkv proj or parallel layer. i remember someone claims that separated qkv is better, but idk why considering more params is worse (treat muon as 2nd order methods). maybe due to poor lr scaling of fused layer’s shape?
8
2
38
i don't know their internal, but guess they put some efforts on precision for rl training. maybe that's one of their moat??
1
0
9
when i join the scene, fp16 mixed-precision training was dominant, and i personally love how oai folks worked hard to stabilize this in dall-e 1 paper. per block grad scaling, fp32 for sensitive ops, different format for m, v, clip v vy value to 5, but used powerSGD, lol
3
6
58
lmfao, this week is like - “who use full attn till now? -> linear attn sucks at agentic task (minimax) -> KDA is sota (kimi) -> …” - “who care fp16 in 2025? -> narrow dynamic ranges but higher precision matters for rl -> …”
5
7
134
just noticed modded-nanogpt adopt 'NorMuon' as default (?). it looks like `AdaMuon`. i personally didnt buy this idea because i thought Muon is enough and dont want to introduce optim state for 2nd moment again like adam... hmm https://t.co/pMAIOuNGW5
https://t.co/YG1hjC3BEY
6
7
65
fwiw, when casting rollout actor as bf16 with vllm in 2023, most of samples are clipped out when using ppo even at first step (should log prob diff = 0) i suspected our early llm had wider activation and param range, so temporarily fixed this with fp16. https://t.co/zZzI5TgmIi
FP16 can have a smaller training-inference gap compared to BFloat16, thus fits better for RL. Even the difference between RL algorithms vanishes once FP16 is adopted. Surprising!
2
1
32
putting everything else aside, don’t like the leetcode style interviews either, but isn’t the whole point of coding interviews to see how a candidate approaches and solves a problem? tbh, I never really understood why people thought this kind of thing was ‘cool’.
"Buy our product for 899$ to cheat your way in 200k$ job" In South Korea and Japan, you can leave your wallet and laptop in the public and nobody would steal. People will return your wallet if you lost in in the street. We have a collective 'social capital', where all implictly
2
0
8
4. and some publically available frontier models like dsv3 still use SP with low enough std like 0.006 (4/n) lately, I’ve been reflecting on how important data processing and large-scale engineering are for good production, imho.
2
0
19
3. muon's gain is actually 1.2~1.4x compared to well tuned adamw, and it shows diminishing returns in over-trained regime or larger model size (but i agree with that large critical batch size matters in large scale and 1.2x gain is still huge!) https://t.co/ocIPJqnw6K (3/n)
2
0
23
2. hidden matrices' curvature barely changes during training, except final layer https://t.co/gNyq5JyzhC (2/n)
1
0
14
my takes on recent studies about optimization is like how hard to dethrone adam(w) and SP combination in large scale... 1. muP plays a crucial role on early training but it enters a steady state after sufficient steps https://t.co/zShOookCEG
https://t.co/prQ1e5wjAw (1/n)
arxiv.org
Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ($μ$P) enables learning-rate transfer across widths by equalizing early-time...
Why override µP? Because its core assumptions only hold very early in training! In practice wide models quickly stop being more sensitive to weight updates than smaller models! This is caused by changes in the geometric alignment of updates and layer inputs over training. 🧵6/8
2
7
93
@AtliKosson oh sorry, they used adafactor where it's equivalent to adamw with param scaling lr, so it is more complicated. in addition, google's framework like tnesorflow gonna perform independent WD unlike torch when using adamw.
2
0
1
but, did you guys know that google used lr^2 WD? assume they scaled lr properly like 1/d rule. this simple muP transfer lr well even though it's not maximal. but if it's right, it is opposite to above works, it further scale down WD by 1/d^2 for all weights. wdyt @AtliKosson ?
1
0
4
interesting https://t.co/D7QaOwjktX
@cloneofsimo It’s not necessarily opposite, we discuss this a bit in the manuscript. For short runs WD doesn’t really matter, but it also turns out without WD all LRs kind of work the same in the long run. The size of the relative feature updates loses its dependence on the peak LR.
0
0
5