Keller Jordan
@kellerjordan0
Followers
13K
Following
3K
Media
174
Statuses
1K
CIFAR-10 fanatic @OpenAI
San Francisco
Joined March 2016
Neural network trainings are nondeterministic. Repeated runs each produce a unique network, often with significantly _varying_ test-set performance. 🆕📜 I demonstrate that this variation has a simple statistical structure, and is harmless & inevitable https://t.co/1zzpNHi0Vy
26
154
1K
How did an intern with no AI experience break a world record? Read our new blog post:
hiverge.ai
Hiverge
We challenged our intern @ramadan_al76760 (zero prior AI experience) to beat the CIFAR-10 training speed record using @hivergeai's algorithmic discovery engine. Result: Sub-2-second (!!) training for the first time ever.
0
4
20
@yonashav Oops, I meant to say observable universe, not known universe.
0
0
6
Update: @yonashav pointed out that if we allow FTL travel, then this bound fails because the mass (and therefore computational capacity) of the entire universe is potentially much larger than the known universe. So to get a bound, we do have to assume the impossibility of FTL.
5
2
113
Theorem: The maximum possible duration of the computational singularity is 470 years. Proof: The FLOPs capacity of all computers which existed in the year 1986 is estimated to be at most 4.5e14 (Hilbert et al. 2011). Based on public Nvidia revenue and GPU specs, this capacity
64
51
623
We challenged our intern @ramadan_al76760 (zero prior AI experience) to beat the CIFAR-10 training speed record using @hivergeai's algorithmic discovery engine. Result: Sub-2-second (!!) training for the first time ever.
New CIFAR-10 training speed record: 94% in 1.99 seconds on one A100 Previous record: 2.59 seconds (Nov. 10th 2024) New record-holder: Algorithmic discovery engine developed by @hivergeai Changelog: - Muon: Vectorize NS iter and reduce frequency of 'normalize weights' step 1/3
2
7
119
Introducing DOJE: The first U.S. memecoin ETF giving you spot exposure to Dogecoin via a traditional ETF.
5
47
102
- TTA: Skip for easy examples - Thermal throttling: Sleep for 8s between runs (only affects average not record time) Note: The authors reported a time of 2.02 seconds. My reproduction (torch 2.7.0; hardware seen below) had a min time of 1.99s. Code: https://t.co/qPN6oebF5T 3/3
1
4
66
- Data aug: Add color jitter and vectorize random crop - Compilation: Compile xent fwd/bwd - Architecture: Replace GELU with SiLU, use SVD for first layer init, and use channels_last format with fp16 for all convs - Hparams: tweaks including bsz 2000 -> 1536 & epochs 8 -> 7.6 2/3
1
3
42
New CIFAR-10 training speed record: 94% in 1.99 seconds on one A100 Previous record: 2.59 seconds (Nov. 10th 2024) New record-holder: Algorithmic discovery engine developed by @hivergeai Changelog: - Muon: Vectorize NS iter and reduce frequency of 'normalize weights' step 1/3
5
40
386
Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single,
664
3K
24K
There's been significant recent progress in the NanoGPT speedrun. Highly recommend this post by @classiclarryd
https://t.co/QNoI4wVAJg
lesswrong.com
In early 2024 Andrej Karpathy stood up an llm.c repo to train GPT-2 (124M), which took an equivalent of 45 minutes on 8xH100 GPUs to reach 3.28 cross…
11
57
513
Down to 146.8s on modded-nanogpt! https://t.co/OV0TaesL4I Surprising result: Different parameter groups have different sensitivity to batch size. Instead of picking a single batch size, grad accumulation can be managed on a param level to simulate different batch sizes.
github.com
This submission reflects all recent WR changes up to PR#134. The main contribution of this PR is to introduce the concept of variable batch size by parameter group, by having different gradient acc...
1
4
40
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
19
210
1K
FUTURES TRADERS: Get 40% off all evals, no activation fees, end-of-day drawdown in our live-market PRO+ accounts…and still daily PRO payouts!
0
1
18
Like, I think there’s a strong case to ask for better citation. But it should be recognized that using residuals to get ImageNet SOTA has value in of itself. Otherwise it’s as if Fermat walked on stage with Andrew Wiles and said “guys what’s the big deal, I told you this already”
9
4
76
The problem with this line is that any history of neural network science that focuses only on the ideas, while ignoring the quality of evidence given for those ideas, is as incomplete as a history of math that focuses only on the theorems w/o the proofs https://t.co/G9p3V8m8an
The most cited paper of the 21st century is on deep residual learning with residual connections. Who invented this? Timeline: ★ 1991: @HochreiterSepp solves vanishing gradient problem through recurrent residual connections (weight 1.0) ★ 1997 LSTM: plain recurrent residual
5
11
237
…when it’s much more likely that only the explosion is characterized that way, not the future. Another easy inference: Neither global population-explosion nor population-collapse can be “tensions of the far-future as such.” Since they both stop themselves in the near-future.
0
0
2
For the purposes of Figuring Out Wtf Is Going On, it seems crucial to keep the explosion in mind. For example, someone looking only at the distinctive qualities of the present might say “the future is characterized by parents being perplexed by the lives of their children.” …
1
0
5
COACHES These are the guys you want to take with you to a fight 🪓 Class of 2026 Berserker Hammer Linebackers @PateBowers
@ZachCullen8
@LukeFaulkner28
@CameronMelvin26
@Bo_davis18
@Josh_Mcc10
@Cason2411
@lemoine_luke27
@glilly_01
@704wurster
@calebfowler14
@thomas_green3
3
17
32
The singular thing that we can say about the far-future is that it will almost-always *not* be undergoing an explosion. Since in a finite universe, exponentials run out. So in that sense the far-future will look more like the far-past than the present. https://t.co/pXMDnZWNZf
good reminder that we're living in an explosion, funny that things still seem slow & normal day to day and recorded history is still only around one millionth of earth's full history hard to imagine what future will look like +250k years from here
1
0
14
Random idea: It should be possible to make any dataset autoregressively unlearnable by prefixing each of its contexts with (encrypted ciphertext of that context, randomized decryption key, execution trace of decryption) => every next-token is either deterministic or noise-like.
9
2
169
Great to see this effort towards rigorous hyperparameter tuning. Two areas for improvement: 1. IIUC, the scaled up run here isn't actually tuned at all - its hparams are set via extrapolation 2. Sensitive hparams need a more granular sweep than power-of-2 https://t.co/O5vG58q3Wx
We did a very careful study of 10 optimizers with no horse in the race. Despite all the excitement about Muon, Mars, Kron, Soap, etc., at the end of the day, if you tune the hyperparameters rigorously and scale up, the speedup over AdamW diminishes to only 10% :-( Experiments
4
10
172