Keller Jordan @kellerjordan0 X Profile

Keller Jordan

@kellerjordan0

Followers

13K

Following

3K

Media

174

Statuses

1K

CIFAR-10 fanatic @OpenAI

San Francisco

Joined March 2016

Don't wanna be here? Send us removal request.

Keller Jordan

@kellerjordan0

3 years

Neural network trainings are nondeterministic. Repeated runs each produce a unique network, often with significantly _varying_ test-set performance. 🆕📜 I demonstrate that this variation has a simple statistical structure, and is harmless & inevitable https://t.co/1zzpNHi0Vy

26

154

1K

Keller Jordan

@kellerjordan0

7 days

TIL that Muon is in PyTorch stable now. Pretty cool.

15

58

862

Hiverge

@hivergeai

20 days

How did an intern with no AI experience break a world record? Read our new blog post:

hiverge.ai

Hiverge

Alhussein Fawzi

@AlhusseinFawzi

22 days

We challenged our intern @ramadan_al76760 (zero prior AI experience) to beat the CIFAR-10 training speed record using @hivergeai's algorithmic discovery engine. Result: Sub-2-second (!!) training for the first time ever.

0

4

20

Keller Jordan

@kellerjordan0

20 days

@yonashav Oops, I meant to say observable universe, not known universe.

0

6

Tesla

@Tesla

24 days

Teslas have the lowest maintenance & repair costs of any brand

0

1K

11K

Keller Jordan

@kellerjordan0

21 days

Update: @yonashav pointed out that if we allow FTL travel, then this bound fails because the mass (and therefore computational capacity) of the entire universe is potentially much larger than the known universe. So to get a bound, we do have to assume the impossibility of FTL.

5

2

113

Keller Jordan

@kellerjordan0

21 days

Theorem: The maximum possible duration of the computational singularity is 470 years. Proof: The FLOPs capacity of all computers which existed in the year 1986 is estimated to be at most 4.5e14 (Hilbert et al. 2011). Based on public Nvidia revenue and GPU specs, this capacity

64

51

623

Alhussein Fawzi

@AlhusseinFawzi

22 days

We challenged our intern @ramadan_al76760 (zero prior AI experience) to beat the CIFAR-10 training speed record using @hivergeai's algorithmic discovery engine. Result: Sub-2-second (!!) training for the first time ever.

Keller Jordan

@kellerjordan0

22 days

New CIFAR-10 training speed record: 94% in 1.99 seconds on one A100 Previous record: 2.59 seconds (Nov. 10th 2024) New record-holder: Algorithmic discovery engine developed by @hivergeai Changelog: - Muon: Vectorize NS iter and reduce frequency of 'normalize weights' step 1/3

2

7

119

REX Shares

@REXShares

1 month

Introducing DOJE: The first U.S. memecoin ETF giving you spot exposure to Dogecoin via a traditional ETF.

5

47

102

Keller Jordan

@kellerjordan0

22 days

- TTA: Skip for easy examples - Thermal throttling: Sleep for 8s between runs (only affects average not record time) Note: The authors reported a time of 2.02 seconds. My reproduction (torch 2.7.0; hardware seen below) had a min time of 1.99s. Code: https://t.co/qPN6oebF5T 3/3

1

4

66

Keller Jordan

@kellerjordan0

22 days

- Data aug: Add color jitter and vectorize random crop - Compilation: Compile xent fwd/bwd - Architecture: Replace GELU with SiLU, use SVD for first layer init, and use channels_last format with fp16 for all convs - Hparams: tweaks including bsz 2000 -> 1536 & epochs 8 -> 7.6 2/3

1

3

42

Keller Jordan

@kellerjordan0

22 days

New CIFAR-10 training speed record: 94% in 1.99 seconds on one A100 Previous record: 2.59 seconds (Nov. 10th 2024) New record-holder: Algorithmic discovery engine developed by @hivergeai Changelog: - Muon: Vectorize NS iter and reduce frequency of 'normalize weights' step 1/3

5

40

386

Andrej Karpathy

@karpathy

24 days

Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single,

664

3K

24K

Keller Jordan

@kellerjordan0

25 days

There's been significant recent progress in the NanoGPT speedrun. Highly recommend this post by @classiclarryd https://t.co/QNoI4wVAJg

lesswrong.com

In early 2024 Andrej Karpathy stood up an llm.c repo to train GPT-2 (124M), which took an equivalent of 45 minutes on 8xH100 GPUs to reach 3.28 cross…

11

57

513

Larry Dial

@classiclarryd

1 month

Down to 146.8s on modded-nanogpt! https://t.co/OV0TaesL4I Surprising result: Different parameter groups have different sensitivity to batch size. Instead of picking a single batch size, grad accumulation can be managed on a param level to simulate different batch sizes.

github.com

This submission reflects all recent WR changes up to PR#134. The main contribution of this PR is to introduce the concept of variable batch size by parameter group, by having different gradient acc...

1

4

40

Jeremy Cohen

@deepcohen

1 month

Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.

19

210

1K

TakeProfitTrader

@TakeProfitLLC

22 hours

FUTURES TRADERS: Get 40% off all evals, no activation fees, end-of-day drawdown in our live-market PRO+ accounts…and still daily PRO payouts!

0

1

18

Keller Jordan

@kellerjordan0

1 month

Like, I think there’s a strong case to ask for better citation. But it should be recognized that using residuals to get ImageNet SOTA has value in of itself. Otherwise it’s as if Fermat walked on stage with Andrew Wiles and said “guys what’s the big deal, I told you this already”

9

4

76

Keller Jordan

@kellerjordan0

1 month

The problem with this line is that any history of neural network science that focuses only on the ideas, while ignoring the quality of evidence given for those ideas, is as incomplete as a history of math that focuses only on the theorems w/o the proofs https://t.co/G9p3V8m8an

Jürgen Schmidhuber

@SchmidhuberAI

1 month

The most cited paper of the 21st century is on deep residual learning with residual connections. Who invented this? Timeline: ★ 1991: @HochreiterSepp solves vanishing gradient problem through recurrent residual connections (weight 1.0) ★ 1997 LSTM: plain recurrent residual

5

11

237

Keller Jordan

@kellerjordan0

1 month

…when it’s much more likely that only the explosion is characterized that way, not the future. Another easy inference: Neither global population-explosion nor population-collapse can be “tensions of the far-future as such.” Since they both stop themselves in the near-future.

0

2

Keller Jordan

@kellerjordan0

1 month

For the purposes of Figuring Out Wtf Is Going On, it seems crucial to keep the explosion in mind. For example, someone looking only at the distinctive qualities of the present might say “the future is characterized by parents being perplexed by the lives of their children.” …

1

0

5

Coach LB Viking

@CoaLBViking

2 days

COACHES These are the guys you want to take with you to a fight 🪓 Class of 2026 Berserker Hammer Linebackers @PateBowers @ZachCullen8 @LukeFaulkner28 @CameronMelvin26 @Bo_davis18 @Josh_Mcc10 @Cason2411 @lemoine_luke27 @glilly_01 @704wurster @calebfowler14 @thomas_green3

3

17

32

Keller Jordan

@kellerjordan0

1 month

The singular thing that we can say about the far-future is that it will almost-always *not* be undergoing an explosion. Since in a finite universe, exponentials run out. So in that sense the far-future will look more like the far-past than the present. https://t.co/pXMDnZWNZf

Ross

@rpoo

1 month

good reminder that we're living in an explosion, funny that things still seem slow & normal day to day and recorded history is still only around one millionth of earth's full history hard to imagine what future will look like +250k years from here

1

0

14

Keller Jordan

@kellerjordan0

2 months

Random idea: It should be possible to make any dataset autoregressively unlearnable by prefixing each of its contexts with (encrypted ciphertext of that context, randomized decryption key, execution trace of decryption) => every next-token is either deterministic or noise-like.

9

2

169

Keller Jordan

@kellerjordan0

2 months

Great to see this effort towards rigorous hyperparameter tuning. Two areas for improvement: 1. IIUC, the scaled up run here isn't actually tuned at all - its hparams are set via extrapolation 2. Sensitive hparams need a more granular sweep than power-of-2 https://t.co/O5vG58q3Wx

Percy Liang

@percyliang

2 months

We did a very careful study of 10 optimizers with no horse in the race. Despite all the excitement about Muon, Mars, Kron, Soap, etc., at the end of the day, if you tune the hyperparameters rigorously and scale up, the speedup over AdamW diminishes to only 10% :-( Experiments

4

10

172