
Keller Jordan
@kellerjordan0
Followers
9K
Following
3K
Media
162
Statuses
1K
Unfortunately, it is hard to trust *claims* in 2025. What’s easier to trust is *incentives*. So here’s an incentive: I’ll pay a $3,000 bounty to the first person who uses this method to improve either the NanoGPT or CIFAR-10 speedruns.
1/9 🚨 New Paper Alert: Cross-Entropy Loss is NOT What You Need! 🚨.We introduce harmonic loss as alternative to the standard CE loss for training neural networks and LLMs! Harmonic loss achieves 🛠️significantly better interpretability, ⚡faster convergence, and ⏳less grokking!
27
66
1K
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
27
81
1K
Here's a variant of @karpathy's NanoGPT which trains twice as fast, reaching GPT-2 level quality in 5B tokens instead of the original 10B. It uses rotary embeddings and an improved lr schedule.
14
75
1K
Btw, I joined the OpenAI OpCo, LLC. Excited to do some science and contribute stuff into some big training runs. Yep, and shout-out to my amazing mentors @adamlerer and @mobav0!
62
12
634
New NanoGPT training speed record: 3.28 Fineweb validation loss in 15.2 minutes. Previous record: 22.3 minutes.Changelog:.- pad embedding to nearest 64.- switch from GELU to ReLU².- zero-init projection layers.- QKNorm. All four changes driven by @Grad62304977.1/8
30
48
552
New training speed record for @karpathy's NanoGPT setup: 3.28 Fineweb val loss in 22.3 minutes. Previous record: 24.9 minutes.Changelog:.- Removed learning rate warmup, since the optimizer (Muon) doesn't need it.- Rescaled Muon's weight updates to have unit variance per param.1/5
12
41
437
It's a new day, and here's a new NanoGPT speedrun record: 3.28 FineWeb val loss in 8.2 minutes on 8xH100. Previous record: 10.8 minutes.Changelog:.- architectural shortcuts.- momentum warmup.- tanh logit capping. By @Grad62304977 and myself.1/6
15
39
411
This is an exciting moment: The world's first report on successful large-scale training with a super-Adamic optimizer. Congratulations to the @Kimi_Moonshot team and to every Muon contributor: @Yuchenj_UW @bozavlado @YouJiacheng @leloykun L. Newhouse @jxbz.
🚀 Introducing our new tech report: Muon is Scalable for LLM Training. We found that Muon optimizer can be scaled up using the follow techniques: .• Adding weight decay.• Carefully adjusting the per-parameter update scale. ✨ Highlights:.• ~2x computational efficiency vs AdamW
7
29
374
Along with many others, I find the results of Git Re-Basin by @SamuelAinsworth, J. Hayase & .@siddhss5 highly interesting. But I believe there is a crucial detail which deserves attention: The authors replace BatchNorm with LayerNorm in their ResNet and VGG implementations. 1/14.
6
51
320
New NanoGPT training speed record: 3.28 FineWeb val loss in 5.03 minutes. Previous record: 7.2 minutes.Changelog: FlexAttention with large sequence length. This record is by @KoszarskyB
11
22
308
NanoGPT speedrunning update: Using the SOAP optimizer (, @vyasnikhil96 has achieved a new sample efficiency record of 3.28 Fineweb validation loss in 3.25B training tokens. The previous record was 3.67B tokens by my proposed optimizer.
20
34
296
Here's a new result in NanoGPT speedrunning: Straightforwardly scaling up the speedrun yields a training that reaches GPT-2 (1.5B)'s level of performance in 7.3 hours on 8xH100. The previous record for this target was 24 8xH100-hours by @karpathy using llm.c. 1/10
6
24
257
New NanoGPT training speed record: 3.28 FineWeb val loss in 10.8 minutes on 8xH100. Previous record: 12.0 minutes.Changelog:.- untie embed and head weights.- add RMSNorm after embed.- initialize head to zero. Driven by @Grad62304977
12
17
248
There's been community interest in having a larger NanoGPT category to speedrun. So here's a record to kick things off:.-.New NanoGPT-medium speedrun record: 2.92 FineWeb val loss in 29.3 8xH100-minutes. Prev record: 5.8 hours by @karpathy's llm.c-350M.Method: scaled speedrun
12
20
240
New NanoGPT training speed record: 3.28 FineWeb val loss in 7.23 minutes on 8xH100. Previous record: 7.8 minutes.Changelog:.- Added U-net-like connectivity pattern.- Doubled learning rate. This record is by @brendanh0gan
9
11
238
New NanoGPT training speed record: 3.28 FineWeb val loss in 3.4 minutes on 8xH100. Previous record: 3.58 minutes.Change: Lowered logit softcap from 30 to 15. This record was discovered by @KoszarskyB, congratulations!
11
15
232
World record #20 of the NanoGPT speedrun has broken the 3-minute barrier.
Sub 3-minute NanoGPT Speedrun Record. We're proud to share that we've just breached the 3 min mark!. This means that with an ephemeral pod of 8xH100s that costs $8/hour, training a GPT-2-ish level model now only costs $0.40!. ---. What's in the latest record? A 🧵.
3
11
189
Nice, looks like X posts are now citable artifacts.
Over the past month, methods developed by myself and my collaborators were used to set new speed records for training LLMs up to 1.5B scale. I also want to help the science go faster, so now get ready for:. ~The General Theory of Modular Duality~. (1/9).
5
4
178
Here, I'll prove it: "The reason the Sophia paper (from a Stanford lab / has >100 citations) didn't lead to a optimization revolution because their Adam baseline used a suboptimal learning rate".
The reason papers alone can't provide strong evidence is because if they contain a mistake (like an untuned hyperparameter), ***nothing happens***. Whereas, if a speedrun contains an untuned hyperparameter, we find out *automatically* in the next record.
6
3
179
This is officially the new record! Congrats @hi_tysam (who is also an OG of CIFAR-10 speedrunning).
New NanoGPT training speed record: 3.28 FineWeb val loss in 4.66 minutes. Previous record: 5.03 minutes.Changelog: .- FlexAttention blocksize warmup.- hyperparameter tweaks
3
11
144
I've decided to name the optimizer described in this thread `Muon`, because it takes each update matrix produced by standard sgd-MomentUm and replaces it with the nearest Orthogonal matrix using a Newton-schulz iteration. 1/5.
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
1
3
141
The reason papers alone can't provide strong evidence is because if they contain a mistake (like an untuned hyperparameter), ***nothing happens***. Whereas, if a speedrun contains an untuned hyperparameter, we find out *automatically* in the next record.
The reason I didn't write a proper arxiv paper for Muon is because I simply don't think there's any relationship between the ability to publish a paper with lots of good-looking results about a new optimizer, and whether that optimizer actually works. I only trust speedruns.
6
5
127
NanoGPT speedrunning update: @bozavlado discovered that the new optimizer performs ~3% better if we orthogonalize the QKV updates separately rather than together. I replicated this and found that it also holds for SOAP; it was used in yesterday’s record.
@kellerjordan0 Yes, but with your older code (with warmup and w/o scaling by number of elements). Also this could be seed dependent, etc. Take it with very huge grain of salt.
8
9
116
Why don’t current model merging results generalize to standard ConvNets? And how can this be fixed?. We answer these Qs and present a method that improves merged NN performance for any choice of norm layer. W/ @HanieSedghi @osaukh @rahiment @bneyshabur
2
23
112
@khandelia1000 My crazy answer is "all of it.". There are many *techniques* which don't transfer, but in my experience it's always comprehensible and informative *why* they don't transfer, so you end up learning anyway.
2
0
108
My thought immediately after seeing the figure: the gap between proposed method and Adam is too large, so the Adam run must have been be improperly tuned.
What if you could use all the computing power in the world to train a shared, open source AI model?. Preliminary report: Nous Research is proud to release a preliminary report on DisTrO (Distributed Training Over-the-Internet) a family of
3
2
106
@nearcyan The Minecraft redstone community has higher standards of academic rigor than ML research. Might be hard to convince them to switch to the lesser field.
0
3
91
Interesting paper. What confuses me is that the theory doesn’t seem to predict the need for a roughly constant-fraction decay duration. Why do we need to decay for ~1000 steps in a 10K step training, and ~10K steps in a 100K step training?.
WSD learning rate is taking off—lower loss, no pre-set compute budget, & easier continual training. Yet, its loss curve is puzzling—high in stable phase but jumps in decay phase. Our paper explains it with a 'River Valley' structure of the loss! 🧵🧵
5
5
99
I shall be using GPUs very graciously provided by @Yuchenj_UW @hyperbolic_labs to search for optimal hyperparameters for both this optimizer as well as DistributedShampoo, for the purpose of NanoGPT speedrunning, on the recommendation of @_arohan_ .
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
7
3
93
Happy to say this project has been accepted to ICLR. One of its central results: for stable binary classification trainings, it is possible to bound variance between runs *a priori* via the formula. var(error rate) <= (avg err rate) / 2(num test examples).
Neural network trainings are nondeterministic. Repeated runs each produce a unique network, often with significantly _varying_ test-set performance. 🆕📜 I demonstrate that this variation has a simple statistical structure, and is harmless & inevitable.
4
8
90
A tiny research contribution: here's an explanation for a fact about neural networks that was considered surprising in 2022. @yidingjiang et al. (2022) in their paper "Assessing Generalization via Disagreement" ( observe with surprise that the
6
6
86
I performed this experiment to obtain a response to the following two critiques:. - The methods only work for small models and won't scale. (well, they scale to 1.5B at least).- The methods only help val loss and not downstream perf. (nope, they do help).
Here's a new result in NanoGPT speedrunning: Straightforwardly scaling up the speedrun yields a training that reaches GPT-2 (1.5B)'s level of performance in 7.3 hours on 8xH100. The previous record for this target was 24 8xH100-hours by @karpathy using llm.c. 1/10
4
2
83
I would like to thank & acknowledge @jxbz for sending me his recent paper, which is where I learned about the crucial Newton-Schulz iteration method. He also had the insight that my initial quintic coefficients could be improved. 7/8.
1
5
74
.@PrimeIntellect has donated a number of H100-hours to support the continuation of this research. Thank you @vincentweisser!.
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
1
4
77
There are some simple ways that all optimizer research can go wrong: e.g., my AdamW baseline could be poorly tuned. So I hereby invite anyone to try to get a better AdamW baseline than I did in this setup; I'll happily boost/RT your result if you can.
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
4
2
68
Thanks @francoisfleuret for these great questions. The answer is that there's no single key idea: I'm using six different techniques which each contribute to the final speed. Each one has different generalization properties. The techniques are (1).
@kellerjordan0 Can you TL;DR ? Is there a key idea or just piling up things? And does it provide general insight applicable elsewhere or this is just performance like speed run in a video game?.
1
7
63
Great work. This is the new record. Congrats @leloykun and @YouJiacheng!.
New NanoGPT training speed record: 3.28 FineWeb val loss in 3.95 minutes. Previous record: 4.41 minutes.Changelog:.- @leloykun arch optimization: ~17s.- remove "dead" code: ~1.5s.- re-implement dataloader: ~2.5s.- re-implement Muon: ~1s.- manual block_mask creation: ~5s
0
8
64
.@BlinkDL_AI has entered the NanoGPT speedrunning game with a new sample-efficiency but not wallclock record, based on RWKV-7 and Muon. You love to see it.
RWKV-7: attention-free and surpassing modded-GPT. Training code & log: Larger headsz can reach 3.26xx. My current implementation is slow🤣Might can reach 85% GPT speed @ ctx1k (or faster than GPT @ ctx4k) after optimization. Any helps are welcome🙏#RWKV
2
3
61
Real. Congrats @YouJiacheng!. The time needed to reach the performance of Karpathy’s NanoGPT/llm.c baseline has gone from 45 to <3.6 minutes on 8xH100. I wouldn’t have predicted it.
New NanoGPT training speed record: 3.28 FineWeb val loss in 3.582 minutes. Changelog.- Truncate RoPE: 1460 steps, 224.5s.- ValueEmbed [0, 1, 2, 3, 4, 5, 5, 4, 3, 2, 1, 0] → [0, 1, 2, None, . , None, 0, 1, 2]: 1470 steps, 222s.- Remove the 8th Attention: 1490 steps, 214.9s
0
4
57
Apropos this rather highly viewed post/method, I'd like to note that the Shampoo optimizer is:.- The only reason I ever even tried this. Shampoo gives a very nearly equivalent update w/o accumulation. - Actually used for massive pretrainings. This is not.
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
5
2
55
Yesterday I posted a small result about GPT-2 training:. Warming up the learning rate for X steps too long effectively delays training by X/2 steps. Here’s a new experiment which provides a bit more evidence for this "warmup/2 law". 🧵/5.
A small result about GPT-2 training: Warming up for too long has a simple and predictable effect on the loss curve. 🧵
4
8
55
Congratulations @YouJiacheng on this new speedrun record! It is an interesting one.
GPT-2 Medium speedrun new record candidate: 6710 steps (estimated time: ~26.1 minutes).previous record: 6950 steps (27.2 minutes).reproducible log: it was timed to be 25.95 minutes when tuning enabled
0
4
54
Hi @stanislavfort, here is a PyTorch notebook which reproduces the basic interpolation result for ResNets on CIFAR-10. It should be runnable without modification. I hope it is helpful in your replication study.
More of my attempt to reproduce the Git Re-Basin paper by @SamuelAinsworth, J. Hayase & @siddhss5 => I don't see the key effect for ResNet on CIFAR-10 🤔. 📊 Plots in the thread .🖥️ Colabs to reproduce them:
1
4
53
94% accuracy on CIFAR-10 is now possible in 5.48 seconds on a single A100. Using (a modification of) @hi_tysam's hyperlightspeedbench -- check out the code!.
As of yesterday, @kellerjordan0 is the new CIFAR10 world record holder, with an unbelievable 5.48 second runtime. 🎉🎊🎉🎊.Another digit barrier falls!!!🎊🎉🎊🎉. His code is available at Insane stuff. Brief summary and future integration deets in thread!.
1
4
51