kellerjordan0 Profile Banner
Keller Jordan Profile
Keller Jordan

@kellerjordan0

Followers
13K
Following
3K
Media
174
Statuses
1K

CIFAR-10 fanatic @OpenAI

San Francisco
Joined March 2016
Don't wanna be here? Send us removal request.
@kellerjordan0
Keller Jordan
3 years
Neural network trainings are nondeterministic. Repeated runs each produce a unique network, often with significantly _varying_ test-set performance. 🆕📜 I demonstrate that this variation has a simple statistical structure, and is harmless & inevitable https://t.co/1zzpNHi0Vy
26
154
1K
@kellerjordan0
Keller Jordan
7 days
TIL that Muon is in PyTorch stable now. Pretty cool.
15
58
862
@hivergeai
Hiverge
20 days
How did an intern with no AI experience break a world record? Read our new blog post:
Tweet card summary image
hiverge.ai
Hiverge
@AlhusseinFawzi
Alhussein Fawzi
22 days
We challenged our intern @ramadan_al76760 (zero prior AI experience) to beat the CIFAR-10 training speed record using @hivergeai's algorithmic discovery engine. Result: Sub-2-second (!!) training for the first time ever.
0
4
20
@kellerjordan0
Keller Jordan
20 days
@yonashav Oops, I meant to say observable universe, not known universe.
0
0
6
@Tesla
Tesla
24 days
Teslas have the lowest maintenance & repair costs of any brand
0
1K
11K
@kellerjordan0
Keller Jordan
21 days
Update: @yonashav pointed out that if we allow FTL travel, then this bound fails because the mass (and therefore computational capacity) of the entire universe is potentially much larger than the known universe. So to get a bound, we do have to assume the impossibility of FTL.
5
2
113
@kellerjordan0
Keller Jordan
21 days
Theorem: The maximum possible duration of the computational singularity is 470 years. Proof: The FLOPs capacity of all computers which existed in the year 1986 is estimated to be at most 4.5e14 (Hilbert et al. 2011). Based on public Nvidia revenue and GPU specs, this capacity
64
51
623
@AlhusseinFawzi
Alhussein Fawzi
22 days
We challenged our intern @ramadan_al76760 (zero prior AI experience) to beat the CIFAR-10 training speed record using @hivergeai's algorithmic discovery engine. Result: Sub-2-second (!!) training for the first time ever.
@kellerjordan0
Keller Jordan
22 days
New CIFAR-10 training speed record: 94% in 1.99 seconds on one A100 Previous record: 2.59 seconds (Nov. 10th 2024) New record-holder: Algorithmic discovery engine developed by @hivergeai Changelog: - Muon: Vectorize NS iter and reduce frequency of 'normalize weights' step 1/3
2
7
119
@REXShares
REX Shares
1 month
Introducing DOJE: The first U.S. memecoin ETF giving you spot exposure to Dogecoin via a traditional ETF.
5
47
102
@kellerjordan0
Keller Jordan
22 days
- TTA: Skip for easy examples - Thermal throttling: Sleep for 8s between runs (only affects average not record time) Note: The authors reported a time of 2.02 seconds. My reproduction (torch 2.7.0; hardware seen below) had a min time of 1.99s. Code: https://t.co/qPN6oebF5T 3/3
1
4
66
@kellerjordan0
Keller Jordan
22 days
- Data aug: Add color jitter and vectorize random crop - Compilation: Compile xent fwd/bwd - Architecture: Replace GELU with SiLU, use SVD for first layer init, and use channels_last format with fp16 for all convs - Hparams: tweaks including bsz 2000 -> 1536 & epochs 8 -> 7.6 2/3
1
3
42
@kellerjordan0
Keller Jordan
22 days
New CIFAR-10 training speed record: 94% in 1.99 seconds on one A100 Previous record: 2.59 seconds (Nov. 10th 2024) New record-holder: Algorithmic discovery engine developed by @hivergeai Changelog: - Muon: Vectorize NS iter and reduce frequency of 'normalize weights' step 1/3
5
40
386
@karpathy
Andrej Karpathy
24 days
Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single,
664
3K
24K
@classiclarryd
Larry Dial
1 month
Down to 146.8s on modded-nanogpt! https://t.co/OV0TaesL4I Surprising result: Different parameter groups have different sensitivity to batch size. Instead of picking a single batch size, grad accumulation can be managed on a param level to simulate different batch sizes.
Tweet card summary image
github.com
This submission reflects all recent WR changes up to PR#134. The main contribution of this PR is to introduce the concept of variable batch size by parameter group, by having different gradient acc...
1
4
40
@deepcohen
Jeremy Cohen
1 month
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
19
210
1K
@TakeProfitLLC
TakeProfitTrader
22 hours
FUTURES TRADERS: Get 40% off all evals, no activation fees, end-of-day drawdown in our live-market PRO+ accounts…and still daily PRO payouts!
0
1
18
@kellerjordan0
Keller Jordan
1 month
Like, I think there’s a strong case to ask for better citation. But it should be recognized that using residuals to get ImageNet SOTA has value in of itself. Otherwise it’s as if Fermat walked on stage with Andrew Wiles and said “guys what’s the big deal, I told you this already”
9
4
76
@kellerjordan0
Keller Jordan
1 month
The problem with this line is that any history of neural network science that focuses only on the ideas, while ignoring the quality of evidence given for those ideas, is as incomplete as a history of math that focuses only on the theorems w/o the proofs https://t.co/G9p3V8m8an
@SchmidhuberAI
Jürgen Schmidhuber
1 month
The most cited paper of the 21st century is on deep residual learning with residual connections. Who invented this? Timeline: ★ 1991: @HochreiterSepp solves vanishing gradient problem through recurrent residual connections (weight 1.0) ★ 1997 LSTM: plain recurrent residual
5
11
237
@kellerjordan0
Keller Jordan
1 month
…when it’s much more likely that only the explosion is characterized that way, not the future. Another easy inference: Neither global population-explosion nor population-collapse can be “tensions of the far-future as such.” Since they both stop themselves in the near-future.
0
0
2
@kellerjordan0
Keller Jordan
1 month
For the purposes of Figuring Out Wtf Is Going On, it seems crucial to keep the explosion in mind. For example, someone looking only at the distinctive qualities of the present might say “the future is characterized by parents being perplexed by the lives of their children.” …
1
0
5
@CoaLBViking
Coach LB Viking
2 days
COACHES These are the guys you want to take with you to a fight 🪓 Class of 2026 Berserker Hammer Linebackers @PateBowers @ZachCullen8 @LukeFaulkner28 @CameronMelvin26 @Bo_davis18 @Josh_Mcc10 @Cason2411 @lemoine_luke27 @glilly_01 @704wurster @calebfowler14 @thomas_green3
3
17
32
@kellerjordan0
Keller Jordan
1 month
The singular thing that we can say about the far-future is that it will almost-always *not* be undergoing an explosion. Since in a finite universe, exponentials run out. So in that sense the far-future will look more like the far-past than the present. https://t.co/pXMDnZWNZf
@rpoo
Ross
1 month
good reminder that we're living in an explosion, funny that things still seem slow & normal day to day and recorded history is still only around one millionth of earth's full history hard to imagine what future will look like +250k years from here
1
0
14
@kellerjordan0
Keller Jordan
2 months
Random idea: It should be possible to make any dataset autoregressively unlearnable by prefixing each of its contexts with (encrypted ciphertext of that context, randomized decryption key, execution trace of decryption) => every next-token is either deterministic or noise-like.
9
2
169
@kellerjordan0
Keller Jordan
2 months
Great to see this effort towards rigorous hyperparameter tuning. Two areas for improvement: 1. IIUC, the scaled up run here isn't actually tuned at all - its hparams are set via extrapolation 2. Sensitive hparams need a more granular sweep than power-of-2 https://t.co/O5vG58q3Wx
@percyliang
Percy Liang
2 months
We did a very careful study of 10 optimizers with no horse in the race. Despite all the excitement about Muon, Mars, Kron, Soap, etc., at the end of the day, if you tune the hyperparameters rigorously and scale up, the speedup over AdamW diminishes to only 10% :-( Experiments
4
10
172