Kwangjun Ahn @KwangjunA X Profile

Kwangjun Ahn

@KwangjunA

Followers

698

Following

92

Media

12

Statuses

72

Senior Researcher at Microsoft Reserach // PhD from MIT EECS

https://t.co/xSFPH4RFkM

Cambridge, MA

Joined February 2020

Don't wanna be here? Send us removal request.

Seunghyun Seo

@SeunghyunSEO7

2 days

sacled up to 12.7B dense, 5.5T tokens. - polynorm (optimized kernel) - grouped diff attn (their work) - parallel muonclip (adopt alltoall like mainhorse, essential, dion) - 80M batch it's still non-reasoning, also not moe though... keep pushing guys! https://t.co/XheiAS8TBO

Seunghyun Seo

@SeunghyunSEO7

3 months

they also dropped fsdp2 optimized muon. though they don't use muon for 2.6b dense model, i think it's just beginning and they are preparing larger one. they pipeline muon's comm-comp with calc flops and the code is neat. not sure if it's existing method. https://t.co/SS7HPQvXjS

3

20

122

Kwangjun Ahn

@KwangjunA

2 months

New improvement in Dion leads to a speedup that makes orthonormal updates (eg. Muon) more scalable for larger matrices. The trick: carefully using Newton-Schulz (on smaller matrices) as Dion's backend. Updates to our microsoft/dion codebase are coming soon---stay tuned!

1

4

26

Microsoft Research

@MSFTResearch

2 months

Join us on Sept 24 at 8 AM PT for Microsoft Research Forum Season 2 – a virtual series highlighting purposeful research and its real-world impact, from fundamental exploration to advancing AI responsibly, scaling innovation through products and open source, and driving positive

1

3

25

Andrej Karpathy

@karpathy

3 months

@jxbz love the repo! clean code, good practices but still not overly over-engineered, triton kernels, well documented, simple reference implementations alongside optimized code. nice

2

5

209

elie

@eliebakouch

3 months

Lot, lot of alpha here

Jeremy Bernstein

@jxbz

3 months

I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice (1/2)

3

11

171

Jeremy Bernstein

@jxbz

3 months

Looks like extremely exciting and useful work by @KwangjunA, Byron Xu, Natalie Abreu, @JohnCLangford and @GagMagakyan https://t.co/8WzCbdljDS (2/2)

github.com

Dion optimizer algorithm. Contribute to microsoft/dion development by creating an account on GitHub.

4

16

145

Jeremy Bernstein

@jxbz

3 months

I had wondered why there was no official Dion implementation by the authors... I guess now we know. This repository looks dope: FSDP Muon and Dion implementations, triton kernels for Newton-Schulz, and lots of practical advice (1/2)

7

19

341

Laker Newhouse

@LakerNewhouse

4 months

[1/6] Curious about Muon, but not sure where to start? I wrote a 3-part blog series called “Understanding Muon” designed to get you up to speed—with The Matrix references, annotated source code, and thoughts on where Muon might be going.

7

43

340

John Langford

@JohnCLangford

4 months

Apparently Dion is now being worked on for Torch Titan: https://t.co/QeuRFDTyan :-)

github.com

This PR: 1 - Implements the new DION optimizer based on the paper: "Dion: Distributed Orthonormalized Updates" by Ahn et al. https://arxiv.org/abs/2504.05295 DION follows Muon reg...

Mikhail Parakhin

@MParakhin

4 months

Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...

0

8

104

Mikhail Parakhin

@MParakhin

4 months

Since nobody asked :-), here is my list of papers not to be missed from ICML: 1) Dion: distributed orthonormalized updates (well, technically not at ICML, but everyone's talking about it). 2) MARS: Unleashing the Power of Variance Reduction for Training Large Models 3) ...

6

32

432

Seungwook Han

@seungwookh

4 months

But actually this is the og way of doing it and should stop by E-2103 to see @jxbz and Laker Newhouse whiteboard the whole paper.

Jeremy Bernstein

@jxbz

4 months

Laker and I are presenting this work in an hour at ICML poster E-2103. It’s on a theoretical framework and language (modula) for optimizers that are fast (like Shampoo) and scalable (like muP). You can think of modula as Muon extended to general layer types and network topologies

1

6

75

Konstantin Mishchenko

@konstmish

4 months

Schedule-Free methods, which forgo cosine/linear schedulers by averaging iterates and computing gradients at interpolated points, yield smoother training curves. It's still unclear why they work well, and this paper explains the phenomenon through the river-valley loss landscape.

4

20

144

Gagik Magakyan

@GagMagakyan

4 months

If you are at ICML 2025, come check out our oral presentation about the non-convex theory of Schedule Free SGD in the Optimization session tomorrow! This work was done with amazing collaborators @KwangjunA and @AshokCutkosky.

1

3

Kwangjun Ahn

@KwangjunA

4 months

ICML: come check out our Oral Presentation on Schedule-free training theory based on an elegant online learning!

0

7

49

Jeremy Bernstein

@jxbz

5 months

@eliebakouch @noahamsel @gowerrobert and also Dion by @KwangjunA, @JohnCLangford et al

arxiv.org

Orthonormalized updates accelerate training, improve stability, and enable robust hyperparameter transfer, but existing methods like Muon rely on dense matrix operations that clash with sharded...

1

2

18

You Jiacheng

@YouJiacheng

5 months

@kvfrans @depen_morwani @KwangjunA @vyasnikhil96 Oh I found them: linear warmup and then constant

1

11

Kwangjun Ahn

@KwangjunA

7 months

ICLR: @edward_s_hu and I will be presenting our work "The Belief State Transformer" at the 1st poster session. (#269) Please come check it out! (github: https://t.co/6t0sldrZDX)

github.com

Contribute to microsoft/BST development by creating an account on GitHub.

0

15

John Langford

@JohnCLangford

7 months

The Belief State Transformer https://t.co/1xuRIU0PYT is at ICLR this week. The BST objective efficiently creates compact belief states: summaries of the past sufficient for all future predictions. See the short talk: https://t.co/nkYp7KxMZc and @mgostIH for further discussion.

microsoft.com

John Langford talks about a new transformer architecture that generates compact belief states for goal-conditioned planning, enhancing planning algorithms' efficiency and effectiveness.

5

19

105

John Langford

@JohnCLangford

1 year

New reqs for low to high level researcher positions: https://t.co/gba2GrmdSY , https://t.co/gMXUc8jgqH, https://t.co/XFnLBfTgwk, https://t.co/yXTOoiWoMK, with postdocs from Akshay and @MiroDudik https://t.co/4xbdiiZn6b . Please apply or pass to those who may :-)

0

33

108

John Langford

@JohnCLangford

1 year

Last year, we had offers accepted from @KwangjunA, @riashatislam, @Tea_Pearce , @pratyusha_PS while Akshay and @MiroDudik hired 7(!) postdocs.

John Langford

@JohnCLangford

1 year

New reqs for low to high level researcher positions: https://t.co/gba2GrmdSY , https://t.co/gMXUc8jgqH, https://t.co/XFnLBfTgwk, https://t.co/yXTOoiWoMK, with postdocs from Akshay and @MiroDudik https://t.co/4xbdiiZn6b . Please apply or pass to those who may :-)

0

2

12