kellerjordan0 Profile Banner
Keller Jordan Profile
Keller Jordan

@kellerjordan0

Followers
9K
Following
3K
Media
162
Statuses
1K

CIFAR-10 fanatic @OpenAI

Joined March 2016
Don't wanna be here? Send us removal request.
@kellerjordan0
Keller Jordan
2 years
Neural network trainings are nondeterministic. Repeated runs each produce a unique network, often with significantly _varying_ test-set performance. 🆕📜 I demonstrate that this variation has a simple statistical structure, and is harmless & inevitable.
Tweet media one
25
152
1K
@kellerjordan0
Keller Jordan
3 months
Some trivia: In November I interviewed at both OpenAI & xAI. I thought both labs seemed strong, even tho ppl said xAI was a noncontender back then. But in the end, which to join was an easy choice, because-- the xAI guys told me all my ideas must be wrong & rejected me ¯\_(ツ)_/¯.
30
33
1K
@kellerjordan0
Keller Jordan
4 months
Unfortunately, it is hard to trust *claims* in 2025. What’s easier to trust is *incentives*. So here’s an incentive: I’ll pay a $3,000 bounty to the first person who uses this method to improve either the NanoGPT or CIFAR-10 speedruns.
@dbaek__
David D. Baek
4 months
1/9 🚨 New Paper Alert: Cross-Entropy Loss is NOT What You Need! 🚨.We introduce harmonic loss as alternative to the standard CE loss for training neural networks and LLMs! Harmonic loss achieves 🛠️significantly better interpretability, ⚡faster convergence, and ⏳less grokking!
27
66
1K
@kellerjordan0
Keller Jordan
8 months
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
Tweet media one
27
81
1K
@kellerjordan0
Keller Jordan
1 year
Here's a variant of @karpathy's NanoGPT which trains twice as fast, reaching GPT-2 level quality in 5B tokens instead of the original 10B. It uses rotary embeddings and an improved lr schedule.
14
75
1K
@kellerjordan0
Keller Jordan
4 months
Big GPU doesn't want you to know this but you can actually learn even more about the nature of neural network training from speedrunning CIFAR-10 than NanoGPT, since the experiments are so fast. I've personally trained 15 million CIFAR-10 models.
18
46
935
@kellerjordan0
Keller Jordan
4 months
Btw, I joined the OpenAI OpCo, LLC. Excited to do some science and contribute stuff into some big training runs. Yep, and shout-out to my amazing mentors @adamlerer and @mobav0!
Tweet media one
62
12
634
@kellerjordan0
Keller Jordan
8 months
New NanoGPT training speed record: 3.28 Fineweb validation loss in 15.2 minutes. Previous record: 22.3 minutes.Changelog:.- pad embedding to nearest 64.- switch from GELU to ReLU².- zero-init projection layers.- QKNorm. All four changes driven by @Grad62304977.1/8
Tweet media one
Tweet media two
30
48
552
@kellerjordan0
Keller Jordan
4 months
The reason I didn't write a proper arxiv paper for Muon is because I simply don't think there's any relationship between the ability to publish a paper with lots of good-looking results about a new optimizer, and whether that optimizer actually works. I only trust speedruns.
13
11
453
@kellerjordan0
Keller Jordan
8 months
New training speed record for @karpathy's NanoGPT setup: 3.28 Fineweb val loss in 22.3 minutes. Previous record: 24.9 minutes.Changelog:.- Removed learning rate warmup, since the optimizer (Muon) doesn't need it.- Rescaled Muon's weight updates to have unit variance per param.1/5
Tweet media one
Tweet media two
12
41
437
@kellerjordan0
Keller Jordan
7 months
It's a new day, and here's a new NanoGPT speedrun record: 3.28 FineWeb val loss in 8.2 minutes on 8xH100. Previous record: 10.8 minutes.Changelog:.- architectural shortcuts.- momentum warmup.- tanh logit capping. By @Grad62304977 and myself.1/6
Tweet media one
15
39
411
@kellerjordan0
Keller Jordan
7 months
New result: For a 124M model, apparently:. Repeating 2B tokens 5 times: ❌ much worse than 10B tokens.Repeating 10B tokens 5 times: ✅ just as good as 50B. ⇒ Conjecture: Repeating data is only really bad when you have less than the Chinchilla optimal amount.
Tweet media one
14
39
392
@kellerjordan0
Keller Jordan
1 year
New training speed record for CIFAR-10: 94% accuracy in 3.29 seconds on a single GPU. Paper: Code:
11
56
366
@kellerjordan0
Keller Jordan
3 months
This is an exciting moment: The world's first report on successful large-scale training with a super-Adamic optimizer. Congratulations to the @Kimi_Moonshot team and to every Muon contributor: @Yuchenj_UW @bozavlado @YouJiacheng @leloykun L. Newhouse @jxbz.
@Kimi_Moonshot
Kimi.ai
3 months
🚀 Introducing our new tech report: Muon is Scalable for LLM Training. We found that Muon optimizer can be scaled up using the follow techniques: .• Adding weight decay.• Carefully adjusting the per-parameter update scale. ✨ Highlights:.• ~2x computational efficiency vs AdamW
Tweet media one
Tweet media two
7
29
374
@kellerjordan0
Keller Jordan
7 months
I think the old mechanisms of academic trust don't really work anymore. - Formal peer review.- Professional track records.- Affiliation to elite institutions. Maybe what works is:. - Publishing code that can easily reproduce a result, & cheerfully inviting attempts to disprove it.
17
20
361
@kellerjordan0
Keller Jordan
7 months
New NanoGPT training speed record: 12.03 minutes. Previous record: 13.05 minutes.Changelog: Updated PyTorch to version 2.5
Tweet media one
11
12
336
@kellerjordan0
Keller Jordan
8 months
For low-precision neural network training, I'm finding that ternary weights ({-1, 0, 1}) are consistently outperformed by septernary ({-2, -1, -0.5, 0, 0.5, 1, 2}). The latter format needs 40% fewer parameters to match fp16 performance, while consuming only 5% more bits.
Tweet media one
Tweet media two
13
30
330
@kellerjordan0
Keller Jordan
6 months
All recent NanoGPT training speed records have used Muon as their optimizer. Here's a writeup describing everything we know about it:.
14
41
333
@kellerjordan0
Keller Jordan
3 years
Along with many others, I find the results of Git Re-Basin by @SamuelAinsworth, J. Hayase & .@siddhss5 highly interesting. But I believe there is a crucial detail which deserves attention: The authors replace BatchNorm with LayerNorm in their ResNet and VGG implementations. 1/14.
6
51
320
@kellerjordan0
Keller Jordan
2 years
Something amusing in neural network optimization: There exists an algorithm which fits training data >5x faster than SGD/Adam. But it’s useless in practice. Let me explain. (1/4). Code:
Tweet media one
15
38
318
@kellerjordan0
Keller Jordan
8 months
I enjoy getting NanoGPT training speed records. I’m also interested in making my formulation of NanoGPT speedrunning an accessible benchmark on which other people find it easy to try new ideas. To that end, I have tried to keep the code of the current record short, and minimize.
14
28
323
@kellerjordan0
Keller Jordan
6 months
New NanoGPT training speed record: 3.28 FineWeb val loss in 5.03 minutes. Previous record: 7.2 minutes.Changelog: FlexAttention with large sequence length. This record is by @KoszarskyB
Tweet media one
11
22
308
@kellerjordan0
Keller Jordan
8 months
NanoGPT speedrunning update: Using the SOAP optimizer (, @vyasnikhil96 has achieved a new sample efficiency record of 3.28 Fineweb validation loss in 3.25B training tokens. The previous record was 3.67B tokens by my proposed optimizer.
Tweet media one
20
34
296
@kellerjordan0
Keller Jordan
7 months
Here's an implementation of the Muon optimizer which can be used as a drop-in replacement for AdamW.
6
34
289
@kellerjordan0
Keller Jordan
7 months
New NanoGPT speedrunning result: Shortcut connections scale with training duration. In the 11/06 NanoGPT speedrunning record, we added two shortcut connections to the transformer, giving all blocks access to certain states from the first block. This reduced the number of tokens
Tweet media one
9
19
287
@kellerjordan0
Keller Jordan
10 months
How does the learning rate used to train a neural network affect its predictions? For certain toy models, it can be chaotic. But here’s a demonstration that for standard convnet training its effect is simple - even locally linear - once we average over repeated runs. 🧵1/7
3
29
280
@kellerjordan0
Keller Jordan
1 year
I'm interested in this recent ICLR 2024 spotlight paper from Google research, which found a power-law alignment between bias and variance in softmax probability space. In this thread I'll replicate its central empirical result, but then argue that it
Tweet media one
Tweet media two
Tweet media three
Tweet media four
9
44
257
@kellerjordan0
Keller Jordan
7 months
Here's a new result in NanoGPT speedrunning: Straightforwardly scaling up the speedrun yields a training that reaches GPT-2 (1.5B)'s level of performance in 7.3 hours on 8xH100. The previous record for this target was 24 8xH100-hours by @karpathy using llm.c. 1/10
Tweet media one
Tweet media two
6
24
257
@kellerjordan0
Keller Jordan
7 months
New NanoGPT training speed record: 3.28 FineWeb val loss in 7.8 minutes on 8xH100. Previous record: 8.2 minutes.Changelog: Put hidden states in Bfloat16
Tweet media one
8
10
248
@kellerjordan0
Keller Jordan
7 months
New NanoGPT training speed record: 3.28 FineWeb val loss in 10.8 minutes on 8xH100. Previous record: 12.0 minutes.Changelog:.- untie embed and head weights.- add RMSNorm after embed.- initialize head to zero. Driven by @Grad62304977
Tweet media one
12
17
248
@kellerjordan0
Keller Jordan
4 months
There's been community interest in having a larger NanoGPT category to speedrun. So here's a record to kick things off:.-.New NanoGPT-medium speedrun record: 2.92 FineWeb val loss in 29.3 8xH100-minutes. Prev record: 5.8 hours by @karpathy's llm.c-350M.Method: scaled speedrun
Tweet media one
12
20
240
@kellerjordan0
Keller Jordan
7 months
New NanoGPT training speed record: 3.28 Fineweb val loss in 13.1 minutes. Previous record: 15.2 minutes.Changelog: distributed the overhead of Muon
Tweet media one
14
12
217
@kellerjordan0
Keller Jordan
7 months
New NanoGPT training speed record: 3.28 FineWeb val loss in 7.23 minutes on 8xH100. Previous record: 7.8 minutes.Changelog:.- Added U-net-like connectivity pattern.- Doubled learning rate. This record is by @brendanh0gan
Tweet media one
9
11
238
@kellerjordan0
Keller Jordan
4 months
New NanoGPT-Medium speedrun record: 2.92 FineWeb val loss in 28.1 minutes on 8xH100. Previous record: 29.3 minutes.Changelog: Added standard weight decay to the Muon optimizer
Tweet media one
7
17
242
@kellerjordan0
Keller Jordan
8 months
New CIFAR-10 speed record: 94% in 2.73 seconds on a single A100. Previous record: 3.09 seconds.Changelog: Implemented spectral gradient descent.
8
17
229
@kellerjordan0
Keller Jordan
5 months
New NanoGPT training speed record: 3.28 FineWeb val loss in 3.4 minutes on 8xH100. Previous record: 3.58 minutes.Change: Lowered logit softcap from 30 to 15. This record was discovered by @KoszarskyB, congratulations!
Tweet media one
11
15
232
@kellerjordan0
Keller Jordan
5 months
I would like to issue a citation request for Muon to the following newly appearing paper from Microsoft Research:. Ma et al. (2024). SWAN: Preprocessing SGD Enables Adam-Level Performance On LLM Training With Significant Memory Reduction. 1/5.
7
25
227
@kellerjordan0
Keller Jordan
6 months
Here's an interpretability method for neural net trainings that is strong but expensive: Run the training a few thousand times, and then for pairs of inputs, measure the correlation between their final predicted outputs across runs of training. This yields a highly… 1/8🧵
Tweet media one
Tweet media two
5
16
227
@kellerjordan0
Keller Jordan
4 months
There's a new Microsoft paper and repo that came out today, and it's using the Muon optimizer! That's fun to see
Tweet media one
Tweet media two
Tweet media three
5
17
203
@kellerjordan0
Keller Jordan
4 months
World record #20 of the NanoGPT speedrun has broken the 3-minute barrier.
Tweet media one
@leloykun
leloy!
4 months
Sub 3-minute NanoGPT Speedrun Record. We're proud to share that we've just breached the 3 min mark!. This means that with an ephemeral pod of 8xH100s that costs $8/hour, training a GPT-2-ish level model now only costs $0.40!. ---. What's in the latest record? A 🧵.
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
11
189
@kellerjordan0
Keller Jordan
1 year
Here's a new training speed record for CIFAR-10: 96% accuracy in 35 seconds on a single A100. Code:
1
18
177
@kellerjordan0
Keller Jordan
7 months
Nice, looks like X posts are now citable artifacts.
Tweet media one
Tweet media two
@jxbz
Jeremy Bernstein
7 months
Over the past month, methods developed by myself and my collaborators were used to set new speed records for training LLMs up to 1.5B scale. I also want to help the science go faster, so now get ready for:. ~The General Theory of Modular Duality~. (1/9).
5
4
178
@kellerjordan0
Keller Jordan
7 months
Woke up to this. Sorry haters 😎.
Tweet media one
@kellerjordan0
Keller Jordan
7 months
Muon haters be like "red will intersect green". Let's see. (in a few hours)
Tweet media one
9
6
177
@kellerjordan0
Keller Jordan
4 months
Here, I'll prove it: "The reason the Sophia paper (from a Stanford lab / has >100 citations) didn't lead to a optimization revolution because their Adam baseline used a suboptimal learning rate".
Tweet media one
@kellerjordan0
Keller Jordan
4 months
The reason papers alone can't provide strong evidence is because if they contain a mistake (like an untuned hyperparameter), ***nothing happens***. Whereas, if a speedrun contains an untuned hyperparameter, we find out *automatically* in the next record.
6
3
179
@kellerjordan0
Keller Jordan
1 year
Horizontal flipping is the most common data augmentation in machine learning. 🆕📜 I show that it can be improved for free: instead of flipping randomly, flip half the training images on even epochs and the other half on odd epochs. 1/4
Tweet media one
Tweet media two
3
22
164
@kellerjordan0
Keller Jordan
5 months
the Dark Forest effect in AI social media:. if you have any status to lose, then posting negative comments about popular works has low upside and high potential downside. => knowledgeable people remove themselves from the information ecosystem as predators for 💩 research.
16
2
155
@kellerjordan0
Keller Jordan
6 months
Hey lab: Here is a paper I’d be interested in seeing tried in the NanoGPT speedrun (might require GPU programming).
7
14
152
@kellerjordan0
Keller Jordan
7 months
When I'm reading academic papers, my question is rarely "Does this scale to more datasets and architectures?". Rather I usually just want to know "Can I even trust the one single experiment at the core of the paper?". The extra datasets and architectures just add a little extra.
5
5
153
@kellerjordan0
Keller Jordan
6 months
A blog about Muon by Jianlin Su, the creator of RoPE.
6
10
149
@kellerjordan0
Keller Jordan
6 months
This is officially the new record! Congrats @hi_tysam (who is also an OG of CIFAR-10 speedrunning).
@hi_tysam
Fern
6 months
New NanoGPT training speed record: 3.28 FineWeb val loss in 4.66 minutes. Previous record: 5.03 minutes.Changelog: .- FlexAttention blocksize warmup.- hyperparameter tweaks
Tweet media one
3
11
144
@kellerjordan0
Keller Jordan
8 months
I've decided to name the optimizer described in this thread `Muon`, because it takes each update matrix produced by standard sgd-MomentUm and replaces it with the nearest Orthogonal matrix using a Newton-schulz iteration. 1/5.
@kellerjordan0
Keller Jordan
8 months
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
Tweet media one
1
3
141
@kellerjordan0
Keller Jordan
3 months
A piece of important media literacy that might be non-obvious to newcomers: The plurality of competent talent in our field is incentivized (by both the employers and the public) not to publicly comment on new papers/research.
10
4
134
@kellerjordan0
Keller Jordan
7 months
New CIFAR-10 training speed record: 94% in 2.59 seconds on a single A100. Previous record: 2.73 seconds.Changelog: Upgraded the proto-Muon that I used to set the previous record to the full Muon optimizer.
2
6
128
@kellerjordan0
Keller Jordan
10 months
New CIFAR-10 speed record: 96% in 27.3 seconds on a single A100. Previous record: 34.7 seconds.Changelog: Introduced a small proxy run, the losses of which are used to filter data during the main run.
6
8
125
@kellerjordan0
Keller Jordan
4 months
The reason papers alone can't provide strong evidence is because if they contain a mistake (like an untuned hyperparameter), ***nothing happens***. Whereas, if a speedrun contains an untuned hyperparameter, we find out *automatically* in the next record.
@kellerjordan0
Keller Jordan
4 months
The reason I didn't write a proper arxiv paper for Muon is because I simply don't think there's any relationship between the ability to publish a paper with lots of good-looking results about a new optimizer, and whether that optimizer actually works. I only trust speedruns.
6
5
127
@kellerjordan0
Keller Jordan
7 months
Here is a comparison of the best optimizers I know about for NanoGPT speedrunning. Reproducible logs:
Tweet media one
Tweet media two
4
8
129
@kellerjordan0
Keller Jordan
5 months
Monthly reminder that Muon is still the strongest known optimizer for the most highly tuned public language model training benchmark.
@kellerjordan0
Keller Jordan
6 months
All recent NanoGPT training speed records have used Muon as their optimizer. Here's a writeup describing everything we know about it:.
8
9
126
@kellerjordan0
Keller Jordan
5 months
Thank you to the authors of SWAN, for honoring my citation request for Muon!. Muon has no arxiv paper, but it has open code and *frictionlessly reproducible success on a competitive benchmark* (NanoGPT speedrunning). I am glad to see that standard of evidence honored by citation!
Tweet media one
Tweet media two
Tweet media three
Tweet media four
8
7
120
@kellerjordan0
Keller Jordan
8 months
The new optimizer is defined as follows. It is based on orthogonalizing the update given by SGD-Nesterov-momentum in an efficient way
Tweet media one
Tweet media two
6
3
121
@kellerjordan0
Keller Jordan
8 months
NanoGPT speedrunning update: @bozavlado discovered that the new optimizer performs ~3% better if we orthogonalize the QKV updates separately rather than together. I replicated this and found that it also holds for SOAP; it was used in yesterday’s record.
@bozavlado
Vlado Boza
8 months
@kellerjordan0 Yes, but with your older code (with warmup and w/o scaling by number of elements). Also this could be seed dependent, etc. Take it with very huge grain of salt.
Tweet media one
8
9
116
@kellerjordan0
Keller Jordan
3 years
Why don’t current model merging results generalize to standard ConvNets? And how can this be fixed?. We answer these Qs and present a method that improves merged NN performance for any choice of norm layer. W/ @HanieSedghi @osaukh @rahiment @bneyshabur
Tweet media one
2
23
112
@kellerjordan0
Keller Jordan
9 months
There should be a monthly GPT-2 training speedrunning competition. Fixed dataset. $50K prize to the team who gets the best validation loss after pretraining for 100 H100-hours. Imagine how real and how open things would get. Btw, I would lose bc I don’t know CUDA.
10
3
114
@kellerjordan0
Keller Jordan
11 months
A small result about GPT-2 training: Warming up for too long has a simple and predictable effect on the loss curve. 🧵
Tweet media one
9
11
111
@kellerjordan0
Keller Jordan
8 months
It uses half the memory of AdamW and takes 3% extra wallclock time per step for this setup. Here's code to reproduce the result:
1
4
108
@kellerjordan0
Keller Jordan
4 months
@khandelia1000 My crazy answer is "all of it.". There are many *techniques* which don't transfer, but in my experience it's always comprehensible and informative *why* they don't transfer, so you end up learning anyway.
2
0
108
@kellerjordan0
Keller Jordan
6 months
In the 2020 era I heard that BatchNorm networks are uniquely bad at adapting to distribution shifts. But I never actually saw it be tested. So I ran the experiment and found that it’s not true: BatchNorm networks adapt about as well as Norm-Free networks. (Below is CIFAR-10C).1/8
Tweet media one
2
5
107
@kellerjordan0
Keller Jordan
9 months
My thought immediately after seeing the figure: the gap between proposed method and Adam is too large, so the Adam run must have been be improperly tuned.
@NousResearch
Nous Research
9 months
What if you could use all the computing power in the world to train a shared, open source AI model?. Preliminary report: Nous Research is proud to release a preliminary report on DisTrO (Distributed Training Over-the-Internet) a family of
Tweet media one
3
2
106
@kellerjordan0
Keller Jordan
2 years
@nearcyan The Minecraft redstone community has higher standards of academic rigor than ML research. Might be hard to convince them to switch to the lesser field.
0
3
91
@kellerjordan0
Keller Jordan
7 months
Interesting paper. What confuses me is that the theory doesn’t seem to predict the need for a roughly constant-fraction decay duration. Why do we need to decay for ~1000 steps in a 10K step training, and ~10K steps in a 100K step training?.
@tengyuma
Tengyu Ma
7 months
WSD learning rate is taking off—lower loss, no pre-set compute budget, & easier continual training. Yet, its loss curve is puzzling—high in stable phase but jumps in decay phase. Our paper explains it with a 'River Valley' structure of the loss! 🧵🧵
Tweet media one
5
5
99
@kellerjordan0
Keller Jordan
8 months
Here are three reasons to be skeptical regarding the claim that ternary weights (1.58 bits) are just as good as full-precision:. 1. [Loss curves from `Era of 1 bit LLMs`]: In the loss curves for the project which were released on GitHub (but are absent from the arXiv), there is.
5
14
95
@kellerjordan0
Keller Jordan
2 years
So in conclusion: it’s actually very easy/fast to find zero-loss solutions to typical supervised learning problems in DL. Whereas SGD/Adam are inefficient - but make up for it by having an “implicit bias” towards solutions with much better generalization. (4/4).
3
0
93
@kellerjordan0
Keller Jordan
8 months
I shall be using GPUs very graciously provided by @Yuchenj_UW @hyperbolic_labs to search for optimal hyperparameters for both this optimizer as well as DistributedShampoo, for the purpose of NanoGPT speedrunning, on the recommendation of @_arohan_ .
@kellerjordan0
Keller Jordan
8 months
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
Tweet media one
7
3
93
@kellerjordan0
Keller Jordan
1 year
Happy to say this project has been accepted to ICLR. One of its central results: for stable binary classification trainings, it is possible to bound variance between runs *a priori* via the formula. var(error rate) <= (avg err rate) / 2(num test examples).
Tweet media one
@kellerjordan0
Keller Jordan
2 years
Neural network trainings are nondeterministic. Repeated runs each produce a unique network, often with significantly _varying_ test-set performance. 🆕📜 I demonstrate that this variation has a simple statistical structure, and is harmless & inevitable.
Tweet media one
4
8
90
@kellerjordan0
Keller Jordan
10 months
New CIFAR-10 speed record: 94% in 3.09 seconds on a single NVIDIA A100. Previous record: 3.29 seconds.Changelog: Upgraded PyTorch to version 2.4.
2
1
88
@kellerjordan0
Keller Jordan
10 months
A tiny research contribution: here's an explanation for a fact about neural networks that was considered surprising in 2022. @yidingjiang et al. (2022) in their paper "Assessing Generalization via Disagreement" ( observe with surprise that the
Tweet media one
6
6
86
@kellerjordan0
Keller Jordan
7 months
I performed this experiment to obtain a response to the following two critiques:. - The methods only work for small models and won't scale. (well, they scale to 1.5B at least).- The methods only help val loss and not downstream perf. (nope, they do help).
@kellerjordan0
Keller Jordan
7 months
Here's a new result in NanoGPT speedrunning: Straightforwardly scaling up the speedrun yields a training that reaches GPT-2 (1.5B)'s level of performance in 7.3 hours on 8xH100. The previous record for this target was 24 8xH100-hours by @karpathy using llm.c. 1/10
Tweet media one
Tweet media two
4
2
83
@kellerjordan0
Keller Jordan
1 year
Today I learned that Neyshabur et al. ( had a figure showing double descent way back in 2014
Tweet media one
5
7
82
@kellerjordan0
Keller Jordan
1 year
Jumping onto this train, here's an SGD-Nesterov that outperforms both.
@aaron_defazio
Aaron Defazio
1 year
Schedule-Free (dotted black line) out performing highly tuned SGD!.
1
9
81
@kellerjordan0
Keller Jordan
6 months
@karpathy 🙏 latest record (7.2min -> 5min) is the work of @KoszarskyB.
0
1
75
@kellerjordan0
Keller Jordan
8 months
I would like to thank & acknowledge @jxbz for sending me his recent paper, which is where I learned about the crucial Newton-Schulz iteration method. He also had the insight that my initial quintic coefficients could be improved. 7/8.
1
5
74
@kellerjordan0
Keller Jordan
4 months
Tweet media one
4
5
74
@kellerjordan0
Keller Jordan
8 months
.@PrimeIntellect has donated a number of H100-hours to support the continuation of this research. Thank you @vincentweisser!.
@kellerjordan0
Keller Jordan
8 months
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
Tweet media one
1
4
77
@kellerjordan0
Keller Jordan
7 months
Muon haters be like "red will intersect green". Let's see. (in a few hours)
Tweet media one
2
1
73
@kellerjordan0
Keller Jordan
8 months
There are some simple ways that all optimizer research can go wrong: e.g., my AdamW baseline could be poorly tuned. So I hereby invite anyone to try to get a better AdamW baseline than I did in this setup; I'll happily boost/RT your result if you can.
@kellerjordan0
Keller Jordan
8 months
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
Tweet media one
4
2
68
@kellerjordan0
Keller Jordan
1 year
Thanks @francoisfleuret for these great questions. The answer is that there's no single key idea: I'm using six different techniques which each contribute to the final speed. Each one has different generalization properties. The techniques are (1).
@francoisfleuret
François Fleuret
1 year
@kellerjordan0 Can you TL;DR ? Is there a key idea or just piling up things? And does it provide general insight applicable elsewhere or this is just performance like speed run in a video game?.
1
7
63
@kellerjordan0
Keller Jordan
6 months
Great work. This is the new record. Congrats @leloykun and @YouJiacheng!.
@YouJiacheng
You Jiacheng
6 months
New NanoGPT training speed record: 3.28 FineWeb val loss in 3.95 minutes. Previous record: 4.41 minutes.Changelog:.- @leloykun arch optimization: ~17s.- remove "dead" code: ~1.5s.- re-implement dataloader: ~2.5s.- re-implement Muon: ~1s.- manual block_mask creation: ~5s
Tweet media one
0
8
64
@kellerjordan0
Keller Jordan
2 years
4/ But instead, it turns out that by the end of training, there’s almost no correlation between performance on the two splits. For example, out of ~10^5 repeated trainings, the best network on the first split isn’t even above average on the second.
Tweet media one
2
2
60
@kellerjordan0
Keller Jordan
7 months
.@BlinkDL_AI has entered the NanoGPT speedrunning game with a new sample-efficiency but not wallclock record, based on RWKV-7 and Muon. You love to see it.
@BlinkDL_AI
BlinkDL
7 months
RWKV-7: attention-free and surpassing modded-GPT. Training code & log: Larger headsz can reach 3.26xx. My current implementation is slow🤣Might can reach 85% GPT speed @ ctx1k (or faster than GPT @ ctx4k) after optimization. Any helps are welcome🙏#RWKV
Tweet media one
2
3
61
@kellerjordan0
Keller Jordan
10 months
Here's a mini-contribution:. If "Self-distillation is performing implicit ensemble + knowledge distillation" (Allen-Zhu & Li 2020; ICLR 2023 outstanding paper honorable mention),. then why do ensembles of self-distilled models consistently underperform regular ensembles? 🤔🕵️
Tweet media one
6
3
60
@kellerjordan0
Keller Jordan
5 months
Real. Congrats @YouJiacheng!. The time needed to reach the performance of Karpathy’s NanoGPT/llm.c baseline has gone from 45 to <3.6 minutes on 8xH100. I wouldn’t have predicted it.
@YouJiacheng
You Jiacheng
5 months
New NanoGPT training speed record: 3.28 FineWeb val loss in 3.582 minutes. Changelog.- Truncate RoPE: 1460 steps, 224.5s.- ValueEmbed [0, 1, 2, 3, 4, 5, 5, 4, 3, 2, 1, 0] → [0, 1, 2, None, . , None, 0, 1, 2]: 1470 steps, 222s.- Remove the 8th Attention: 1490 steps, 214.9s
Tweet media one
0
4
57
@kellerjordan0
Keller Jordan
7 months
So I was looking into the weird rhythmic spikes in these curves. And it turns out that it's because I actually ran them with 5 epochs of 10B tokens. Here's what they look like when using 50B unique tokens instead. Pretty similar, surprisingly. 1/3.
Tweet media one
@kellerjordan0
Keller Jordan
7 months
Woke up to this. Sorry haters 😎.
Tweet media one
2
0
57
@kellerjordan0
Keller Jordan
8 months
Apropos this rather highly viewed post/method, I'd like to note that the Shampoo optimizer is:.- The only reason I ever even tried this. Shampoo gives a very nearly equivalent update w/o accumulation. - Actually used for massive pretrainings. This is not.
@kellerjordan0
Keller Jordan
8 months
New training speed record for @karpathy’s 124M-parameter NanoGPT setup: 3.28 Fineweb validation loss in 3.7B training tokens. Previous record: 5B tokens.Changelog: new optimizer.1/8
Tweet media one
5
2
55
@kellerjordan0
Keller Jordan
11 months
Yesterday I posted a small result about GPT-2 training:. Warming up the learning rate for X steps too long effectively delays training by X/2 steps. Here’s a new experiment which provides a bit more evidence for this "warmup/2 law". 🧵/5.
Tweet media one
@kellerjordan0
Keller Jordan
11 months
A small result about GPT-2 training: Warming up for too long has a simple and predictable effect on the loss curve. 🧵
Tweet media one
4
8
55
@kellerjordan0
Keller Jordan
2 months
Congratulations @YouJiacheng on this new speedrun record! It is an interesting one.
@YouJiacheng
You Jiacheng
2 months
GPT-2 Medium speedrun new record candidate: 6710 steps (estimated time: ~26.1 minutes).previous record: 6950 steps (27.2 minutes).reproducible log: it was timed to be 25.95 minutes when tuning enabled
Tweet media one
0
4
54
@kellerjordan0
Keller Jordan
3 years
Hi @stanislavfort, here is a PyTorch notebook which reproduces the basic interpolation result for ResNets on CIFAR-10. It should be runnable without modification. I hope it is helpful in your replication study.
@stanislavfort
Stanislav Fort
3 years
More of my attempt to reproduce the Git Re-Basin paper by @SamuelAinsworth, J. Hayase & @siddhss5 => I don't see the key effect for ResNet on CIFAR-10 🤔. 📊 Plots in the thread .🖥️ Colabs to reproduce them:
Tweet media one
Tweet media two
1
4
53
@kellerjordan0
Keller Jordan
8 months
0
3
54
@kellerjordan0
Keller Jordan
2 years
2/ Background: for standard CIFAR-10 trainings there exist rare “lucky seeds/runs” attaining over +0.5% higher test-set accuracy than the average (10% fewer errors). ImageNet trainings are similar with +0.4%. These differences are considered significant in computer vision.
1
0
52
@kellerjordan0
Keller Jordan
1 year
94% accuracy on CIFAR-10 is now possible in 5.48 seconds on a single A100. Using (a modification of) @hi_tysam's hyperlightspeedbench -- check out the code!.
@hi_tysam
Fern
1 year
As of yesterday, @kellerjordan0 is the new CIFAR10 world record holder, with an unbelievable 5.48 second runtime. 🎉🎊🎉🎊.Another digit barrier falls!!!🎊🎉🎊🎉. His code is available at Insane stuff. Brief summary and future integration deets in thread!.
1
4
51