Kwangjun Ahn Profile
Kwangjun Ahn

@KwangjunA

Followers
698
Following
92
Media
12
Statuses
72

Senior Researcher at Microsoft Reserach // PhD from MIT EECS

Cambridge, MA
Joined February 2020
Don't wanna be here? Send us removal request.
@SeunghyunSEO7
Seunghyun Seo
2 days
sacled up to 12.7B dense, 5.5T tokens. - polynorm (optimized kernel) - grouped diff attn (their work) - parallel muonclip (adopt alltoall like mainhorse, essential, dion) - 80M batch it's still non-reasoning, also not moe though... keep pushing guys! https://t.co/XheiAS8TBO
@SeunghyunSEO7
Seunghyun Seo
3 months
they also dropped fsdp2 optimized muon. though they don't use muon for 2.6b dense model, i think it's just beginning and they are preparing larger one. they pipeline muon's comm-comp with calc flops and the code is neat. not sure if it's existing method. https://t.co/SS7HPQvXjS
3
20
122
@KwangjunA
Kwangjun Ahn
2 months
New improvement in Dion leads to a speedup that makes orthonormal updates (eg. Muon) more scalable for larger matrices. The trick: carefully using Newton-Schulz (on smaller matrices) as Dion's backend. Updates to our microsoft/dion codebase are coming soon---stay tuned!
1
4
26
@MSFTResearch
Microsoft Research
2 months
Join us on Sept 24 at 8 AM PT for Microsoft Research Forum Season 2 – a virtual series highlighting purposeful research and its real-world impact, from fundamental exploration to advancing AI responsibly, scaling innovation through products and open source, and driving positive
1
3
25
@karpathy
Andrej Karpathy
3 months
@jxbz love the repo! clean code, good practices but still not overly over-engineered, triton kernels, well documented, simple reference implementations alongside optimized code. nice
2
5
209
@eliebakouch
elie
3 months
Lot, lot of alpha here
@jxbz
Jeremy Bernstein
3 months
I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice (1/2)
3
11
171
@jxbz
Jeremy Bernstein
3 months
Looks like extremely exciting and useful work by @KwangjunA, Byron Xu, Natalie Abreu, @JohnCLangford and @GagMagakyan https://t.co/8WzCbdljDS (2/2)
Tweet card summary image
github.com
Dion optimizer algorithm. Contribute to microsoft/dion development by creating an account on GitHub.
4
16
145
@jxbz
Jeremy Bernstein
3 months
I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice (1/2)
7
19
341
@LakerNewhouse
Laker Newhouse
4 months
[1/6] Curious about Muon, but not sure where to start? I wrote a 3-part blog series called “Understanding Muon” designed to get you up to speed—with The Matrix references, annotated source code, and thoughts on where Muon might be going.
7
43
340
@JohnCLangford
John Langford
4 months
Apparently Dion is now being worked on for Torch Titan: https://t.co/QeuRFDTyan :-)
Tweet card summary image
github.com
This PR: 1 - Implements the new DION optimizer based on the paper: "Dion: Distributed Orthonormalized Updates" by Ahn et al. https://arxiv.org/abs/2504.05295 DION follows Muon reg...
@MParakhin
Mikhail Parakhin
4 months
Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...
0
8
104
@MParakhin
Mikhail Parakhin
4 months
Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...
6
32
432
@seungwookh
Seungwook Han
4 months
But actually this is the og way of doing it and should stop by E-2103 to see @jxbz and Laker Newhouse whiteboard the whole paper.
@jxbz
Jeremy Bernstein
4 months
Laker and I are presenting this work in an hour at ICML poster E-2103. It’s on a theoretical framework and language (modula) for optimizers that are fast (like Shampoo) and scalable (like muP). You can think of modula as Muon extended to general layer types and network topologies
1
6
75
@konstmish
Konstantin Mishchenko
4 months
Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.
4
20
144
@GagMagakyan
Gagik Magakyan
4 months
If you are at ICML 2025, come check out our oral presentation about the non-convex theory of Schedule Free SGD in the Optimization session tomorrow! This work was done with amazing collaborators @KwangjunA and @AshokCutkosky.
1
1
3
@KwangjunA
Kwangjun Ahn
4 months
ICML: come check out our Oral Presentation on Schedule-free training theory based on an elegant online learning!
0
7
49
@YouJiacheng
You Jiacheng
5 months
@kvfrans @depen_morwani @KwangjunA @vyasnikhil96 Oh I found them: linear warmup and then constant
1
1
11
@KwangjunA
Kwangjun Ahn
7 months
ICLR: @edward_s_hu and I will be presenting our work "The Belief State Transformer" at the 1st poster session. (#269) Please come check it out! (github: https://t.co/6t0sldrZDX)
Tweet card summary image
github.com
Contribute to microsoft/BST development by creating an account on GitHub.
0
0
15
@JohnCLangford
John Langford
7 months
The Belief State Transformer https://t.co/1xuRIU0PYT is at ICLR this week. The BST objective efficiently creates compact belief states: summaries of the past sufficient for all future predictions. See the short talk: https://t.co/nkYp7KxMZc and @mgostIH for further discussion.
Tweet card summary image
microsoft.com
John Langford talks about a new transformer architecture that generates compact belief states for goal-conditioned planning, enhancing planning algorithms' efficiency and effectiveness.
5
19
105
@JohnCLangford
John Langford
1 year
New reqs for low to high level researcher positions: https://t.co/gba2GrmdSY , https://t.co/gMXUc8jgqH, https://t.co/XFnLBfTgwk, https://t.co/yXTOoiWoMK, with postdocs from Akshay and @MiroDudik https://t.co/4xbdiiZn6b . Please apply or pass to those who may :-)
0
33
108
@JohnCLangford
John Langford
1 year
Last year, we had offers accepted from @KwangjunA, @riashatislam, @Tea_Pearce , @pratyusha_PS while Akshay and @MiroDudik hired 7(!) postdocs.
@JohnCLangford
John Langford
1 year
New reqs for low to high level researcher positions: https://t.co/gba2GrmdSY , https://t.co/gMXUc8jgqH, https://t.co/XFnLBfTgwk, https://t.co/yXTOoiWoMK, with postdocs from Akshay and @MiroDudik https://t.co/4xbdiiZn6b . Please apply or pass to those who may :-)
0
2
12