Kwangjun Ahn
@KwangjunA
Followers
698
Following
92
Media
12
Statuses
72
Senior Researcher at Microsoft Reserach // PhD from MIT EECS
Cambridge, MA
Joined February 2020
sacled up to 12.7B dense, 5.5T tokens. - polynorm (optimized kernel) - grouped diff attn (their work) - parallel muonclip (adopt alltoall like mainhorse, essential, dion) - 80M batch it's still non-reasoning, also not moe though... keep pushing guys! https://t.co/XheiAS8TBO
they also dropped fsdp2 optimized muon. though they don't use muon for 2.6b dense model, i think it's just beginning and they are preparing larger one. they pipeline muon's comm-comp with calc flops and the code is neat. not sure if it's existing method. https://t.co/SS7HPQvXjS
3
20
122
New improvement in Dion leads to a speedup that makes orthonormal updates (eg. Muon) more scalable for larger matrices. The trick: carefully using Newton-Schulz (on smaller matrices) as Dion's backend. Updates to our microsoft/dion codebase are coming soon---stay tuned!
1
4
26
Join us on Sept 24 at 8 AM PT for Microsoft Research Forum Season 2 – a virtual series highlighting purposeful research and its real-world impact, from fundamental exploration to advancing AI responsibly, scaling innovation through products and open source, and driving positive
1
3
25
Looks like extremely exciting and useful work by @KwangjunA, Byron Xu, Natalie Abreu, @JohnCLangford and @GagMagakyan
https://t.co/8WzCbdljDS (2/2)
github.com
Dion optimizer algorithm. Contribute to microsoft/dion development by creating an account on GitHub.
4
16
145
I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice (1/2)
7
19
341
[1/6] Curious about Muon, but not sure where to start? I wrote a 3-part blog series called “Understanding Muon” designed to get you up to speed—with The Matrix references, annotated source code, and thoughts on where Muon might be going.
7
43
340
Apparently Dion is now being worked on for Torch Titan: https://t.co/QeuRFDTyan :-)
github.com
This PR: 1 - Implements the new DION optimizer based on the paper: "Dion: Distributed Orthonormalized Updates" by Ahn et al. https://arxiv.org/abs/2504.05295 DION follows Muon reg...
Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...
0
8
104
Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...
6
32
432
But actually this is the og way of doing it and should stop by E-2103 to see @jxbz and Laker Newhouse whiteboard the whole paper.
Laker and I are presenting this work in an hour at ICML poster E-2103. It’s on a theoretical framework and language (modula) for optimizers that are fast (like Shampoo) and scalable (like muP). You can think of modula as Muon extended to general layer types and network topologies
1
6
75
Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.
4
20
144
If you are at ICML 2025, come check out our oral presentation about the non-convex theory of Schedule Free SGD in the Optimization session tomorrow! This work was done with amazing collaborators @KwangjunA and @AshokCutkosky.
1
1
3
ICML: come check out our Oral Presentation on Schedule-free training theory based on an elegant online learning!
0
7
49
1
1
11
ICLR: @edward_s_hu and I will be presenting our work "The Belief State Transformer" at the 1st poster session. (#269) Please come check it out! (github: https://t.co/6t0sldrZDX)
github.com
Contribute to microsoft/BST development by creating an account on GitHub.
0
0
15
The Belief State Transformer https://t.co/1xuRIU0PYT is at ICLR this week. The BST objective efficiently creates compact belief states: summaries of the past sufficient for all future predictions. See the short talk: https://t.co/nkYp7KxMZc and @mgostIH for further discussion.
microsoft.com
John Langford talks about a new transformer architecture that generates compact belief states for goal-conditioned planning, enhancing planning algorithms' efficiency and effectiveness.
5
19
105
New reqs for low to high level researcher positions: https://t.co/gba2GrmdSY , https://t.co/gMXUc8jgqH,
https://t.co/XFnLBfTgwk,
https://t.co/yXTOoiWoMK, with postdocs from Akshay and @MiroDudik
https://t.co/4xbdiiZn6b . Please apply or pass to those who may :-)
0
33
108
Last year, we had offers accepted from @KwangjunA, @riashatislam, @Tea_Pearce , @pratyusha_PS while Akshay and @MiroDudik hired 7(!) postdocs.
New reqs for low to high level researcher positions: https://t.co/gba2GrmdSY , https://t.co/gMXUc8jgqH,
https://t.co/XFnLBfTgwk,
https://t.co/yXTOoiWoMK, with postdocs from Akshay and @MiroDudik
https://t.co/4xbdiiZn6b . Please apply or pass to those who may :-)
0
2
12