
Simo Ryu
@cloneofsimo
Followers
14K
Following
2K
Media
1K
Statuses
4K
MuonClip. so many tricks to make maximum logits bounded during training. Gets me wondering why dont people try LASER (and maybe, z-loss ?).
Very interesting, standard attention causes vanishing gradient due to most prob being very small after some training. LASER tackles this by pushing the attention operation on exponential space. i.e., exp_output = sm(QK^T) exp(V). They dont seem to exaggerate on the performance
7
21
303
Huge respect to Kimi for calling this optimizer after muon instead of re-branding into completely different name bullshit like all the other companies / academics do
🚀 Hello, Kimi K2! Open-Source Agentic Model!.🔹 1T total / 32B active MoE model.🔹 SOTA on SWE Bench Verified, Tau2 & AceBench among open models.🔹Strong in coding and agentic tasks.🐤 Multimodal & thought-mode not supported for now. With Kimi K2, advanced agentic intelligence
9
40
760
Unironically best thing ive ever done in my life was poaching @hanchchch .Hes barely on twitter but if there is 10xEng he is the one. @FAL just keeps winning.
3
0
49
This is incredible! muP allows you to predict transfer across width beyond typical noise level.
While scaling laws typically predict the final loss, we show in our ICML oral paper that good scaling rules enable accurate predictions of entire loss curves of larger models from smaller ones!. w/@Locchiu, @andrewgwils, J. Pennington, A. Agarwala:.1/10
2
0
45
RT @ezyang: Want to play around with code that uses PyTorch DTensors but annoyed at having to go multiprocess? Make a fake process group! T….
0
26
0
The very fact that this post went viral is the hint that the bro's X-Bible is very powerful indeed.
I broke down Twitter's entire algorithm repo w/ Claude Code. And built "The X Bible" tool that scrapes ur profile+feed and identifies profile misalignment and how to go viral in ur niche. comment "X Bible" and I'll give u access to the tool. only sharing with first 200 for rn
1
0
14