Keller Jordan Profile Banner
Keller Jordan Profile
Keller Jordan

@kellerjordan0

Followers
1,029
Following
200
Media
42
Statuses
262

Independent research Prev MLE @ Hive AI, math @ UCSD

San Francisco
Joined March 2016
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@kellerjordan0
Keller Jordan
1 year
Neural network trainings are nondeterministic. Repeated runs each produce a unique network, often with significantly _varying_ test-set performance. 🆕📜 I demonstrate that this variation has a simple statistical structure, and is harmless & inevitable
Tweet media one
23
138
946
@kellerjordan0
Keller Jordan
2 years
Along with many others, I find the results of Git Re-Basin by @SamuelAinsworth , J. Hayase & @siddhss5 highly interesting. But I believe there is a crucial detail which deserves attention: The authors replace BatchNorm with LayerNorm in their ResNet and VGG implementations. 1/14
6
55
331
@kellerjordan0
Keller Jordan
7 months
Something amusing in neural network optimization: There exists an algorithm which fits training data >5x faster than SGD/Adam But it’s useless in practice. Let me explain... (1/4) Code:
Tweet media one
15
38
327
@kellerjordan0
Keller Jordan
2 months
I'm interested in this recent ICLR 2024 spotlight paper from Google research, which found a power-law alignment between bias and variance in softmax probability space In this thread I'll replicate its central empirical result, but then argue that it
Tweet media one
Tweet media two
Tweet media three
Tweet media four
9
45
264
@kellerjordan0
Keller Jordan
1 month
Horizontal flipping is the most common data augmentation in machine learning 🆕📜 I show that it can be improved for free: instead of flipping randomly, flip half the training images on even epochs and the other half on odd epochs. 1/4
Tweet media one
Tweet media two
3
22
167
@kellerjordan0
Keller Jordan
2 years
Why don’t current model merging results generalize to standard ConvNets? And how can this be fixed? We answer these Qs and present a method that improves merged NN performance for any choice of norm layer. W/ @HanieSedghi @osaukh @rahiment @bneyshabur
Tweet media one
2
24
114
@kellerjordan0
Keller Jordan
1 year
@nearcyan The Minecraft redstone community has higher standards of academic rigor than ML research. Might be hard to convince them to switch to the lesser field
0
3
96
@kellerjordan0
Keller Jordan
7 months
So in conclusion: it’s actually very easy/fast to find zero-loss solutions to typical supervised learning problems in DL Whereas SGD/Adam are inefficient - but make up for it by having an “implicit bias” towards solutions with much better generalization. (4/4)
3
0
94
@kellerjordan0
Keller Jordan
4 months
Happy to say this project has been accepted to ICLR One of its central results: for stable binary classification trainings, it is possible to bound variance between runs *a priori* via the formula var(error rate) <= (avg err rate) / 2(num test examples)
Tweet media one
@kellerjordan0
Keller Jordan
1 year
Neural network trainings are nondeterministic. Repeated runs each produce a unique network, often with significantly _varying_ test-set performance. 🆕📜 I demonstrate that this variation has a simple statistical structure, and is harmless & inevitable
Tweet media one
23
138
946
4
8
90
@kellerjordan0
Keller Jordan
1 month
Jumping onto this train, here's an SGD-Nesterov that outperforms both
@aaron_defazio
Aaron Defazio
1 month
Schedule-Free (dotted black line) out performing highly tuned SGD!
6
12
112
1
9
80
@kellerjordan0
Keller Jordan
2 months
Thanks @francoisfleuret for these great questions The answer is that there's no single key idea: I'm using six different techniques which each contribute to the final speed. Each one has different generalization properties. The techniques are (1)
@francoisfleuret
François Fleuret
2 months
@kellerjordan0 Can you TL;DR ? Is there a key idea or just piling up things? And does it provide general insight applicable elsewhere or this is just performance like speed run in a video game?
4
0
23
1
7
66
@kellerjordan0
Keller Jordan
1 year
4/ But instead, it turns out that by the end of training, there’s almost no correlation between performance on the two splits. For example, out of ~10^5 repeated trainings, the best network on the first split isn’t even above average on the second.
Tweet media one
1
2
57
@kellerjordan0
Keller Jordan
2 years
Hi @stanislavfort , here is a PyTorch notebook which reproduces the basic interpolation result for ResNets on CIFAR-10. It should be runnable without modification. I hope it is helpful in your replication study.
@stanislavfort
Stanislav Fort ✨🧠🤖📈✨
2 years
More of my attempt to reproduce the Git Re-Basin paper by @SamuelAinsworth , J. Hayase & @siddhss5 => I don't see the key effect for ResNet on CIFAR-10 🤔 📊 Plots in the thread 🖥️ Colabs to reproduce them:
Tweet media one
Tweet media two
3
11
96
1
4
54
@kellerjordan0
Keller Jordan
5 months
94% accuracy on CIFAR-10 is now possible in 5.48 seconds on a single A100 Using (a modification of) @hi_tysam 's hyperlightspeedbench -- check out the code!
@hi_tysam
Fern
5 months
As of yesterday, @kellerjordan0 is the new CIFAR10 world record holder, with an unbelievable 5.48 second runtime. 🎉🎊🎉🎊 Another digit barrier falls!!!🎊🎉🎊🎉 His code is available at Insane stuff. Brief summary and future integration deets in thread!
1
4
59
1
4
53
@kellerjordan0
Keller Jordan
2 years
The solution: After interpolating between BatchNorm networks, we need to reset the normalization statistics (running_mean and running_var). Once this is done, the barrier is immediately even smaller than that of the LayerNorm networks used in the paper. 7/14
Tweet media one
2
3
46
@kellerjordan0
Keller Jordan
1 month
Currently trying this out on cifar-10 speedrunning (only the most important application in the world 😤)
@aaron_defazio
Aaron Defazio
1 month
Schedule-Free Learning We have now open sourced the algorithm behind my series of mysterious plots. Each plot was either Schedule-free SGD or Adam, no other tricks!
Tweet media one
39
215
1K
2
2
44
@kellerjordan0
Keller Jordan
1 year
2/ Background: for standard CIFAR-10 trainings there exist rare “lucky seeds/runs” attaining over +0.5% higher test-set accuracy than the average (10% fewer errors). ImageNet trainings are similar with +0.4%. These differences are considered significant in computer vision.
1
1
43
@kellerjordan0
Keller Jordan
1 year
6/ This would imply that the test-set accuracy distribution is generated as the sum of a series of independent coin flips, one for each test-set example. And…this simple statistical model actually turns out to be a very good approximation.
Tweet media one
2
1
42
@kellerjordan0
Keller Jordan
1 year
13/ Shoutouts: This project was largely inspired by @david_picard ‘s empirical study “Torch.manual_seed(3407) is all you need”.
@david_picard
David Picard
3 years
"torch.manual_seed(3407) is all you need"! draft 📜: Sorry for the title. I promise it's not (entirely) just for trolling. It's my little spare time project of this summer to investigate unaccounted randomness in #ComputerVision and #DeepLearning . 🧵👇 1/n
7
83
408
2
0
39
@kellerjordan0
Keller Jordan
7 months
The downside is terrible generalization: 55% test error for “TopSGD” solutions, vs. only 15% for the average SGD/Adam solution of ResNet18 on unaugmented CIFAR-10 (3/4)
Tweet media one
1
3
40
@kellerjordan0
Keller Jordan
2 months
Very cool, main skepticism of this would be that it violates the efficient Shazeer hypothesis
@MatPagliardini
Matteo Pagliardini
2 months
A tweak in the architecture of #Transformers can significantly boost accuracy! With direct access to all previous blocks’ outputs, a 48-block #DenseFormer outperforms a 72-block Transformer, with faster inference! A work with @akmohtashami_a , @francoisfleuret , Martin Jaggi. 1/🧵
Tweet media one
28
172
1K
1
1
39
@kellerjordan0
Keller Jordan
1 year
If Google decides to prevent competitors from doing next-frame-prediction trainings on YouTube, they can do so by adding an imperceptible encoding of the next frame to each current frame => the model only learns this spurious correlation, rather than any actual structure of video
3
1
37
@kellerjordan0
Keller Jordan
1 year
10/ My conclusions: variance is both _harmless_ (does not imply almost any differences in model quality) and _inevitable_ (cannot be gotten rid of without sacrificing other beneficial properties of training).
1
2
36
@kellerjordan0
Keller Jordan
2 years
Without LayerNorm, the interpolation between ResNets breaks: if we use BatchNorm, the barrier increases to the point that interpolated networks perform not much better than random chance; measured in terms of test accuracy, the barrier is more than 70%. 2/14
Tweet media one
2
1
34
@kellerjordan0
Keller Jordan
7 months
The algorithm works by optimizing only the last two layers of the network (for ResNet18, the final residual block and fc layer) This allows us to precompute/cache the activations coming in from earlier layers, and is still able to perfectly fit the training data (2/4)
Tweet media one
1
1
31
@kellerjordan0
Keller Jordan
1 year
3/ However, it’s not yet clear if these lucky runs of training are actually better than unlucky ones. To find out, I split the CIFAR-10 test-set into two halves, and evaluated thousands of trained networks against both. Intuitively, lucky networks should do well on both splits.
1
0
31
@kellerjordan0
Keller Jordan
1 year
14/ It also wouldn’t have been possible to efficiently train the ~350K networks needed for it, without the FFCV library from @aleks_madry ‘s lab. Thanks to @bneyshabur and @esiamid for their advisements during the project.
4
1
30
@kellerjordan0
Keller Jordan
1 year
@RokoMijic Good question...! It turns out to be entirely caused by instability to initial conditions, i.e. "chaos" in the training dynamics. Check out my section 3 and !
Tweet media one
2
2
29
@kellerjordan0
Keller Jordan
2 years
Resetting the BatchNorm statistics of the interpolated network fixes this, and allows us to interpolate between standard BatchNorm-based ResNets with an evidently lower barrier than we get with LayerNorm-based networks. 12/14
Tweet media one
2
0
27
@kellerjordan0
Keller Jordan
1 year
5/ Given this lack of correlation between performance on splits of test-set data, I next test the hypothesis that there also aren’t even any correlations between individual examples.
Tweet media one
1
0
26
@kellerjordan0
Keller Jordan
4 months
@nearcyan Given this, why do you think YouTube offers a $14/mo subscription? I’d imagine that their revenue per heavy user is even higher
4
0
24
@kellerjordan0
Keller Jordan
1 year
8/ Turning to the origin of variance, prior works (especially ) have observed that ensembles of independently trained networks make roughly calibrated predictions. I prove that this calibration property alone implies variation in test-set accuracy.
Tweet media one
1
0
24
@kellerjordan0
Keller Jordan
1 year
11/ These findings were obtained from studying standard, well-tuned trainings on CIFAR-10 and ImageNet. As a caveat, for unstable trainings (e.g. too high learning rate) variance exceeds the hypothesis and is certainly not harmless.
Tweet media one
1
1
24
@kellerjordan0
Keller Jordan
2 years
My collaborators @bneyshabur @HanieSedghi @osaukh @rahiment & I will release a paper expanding on these results in a few weeks. Given the current level of interest in this work, we decided to also post this thread today. 13/14
2
0
24
@kellerjordan0
Keller Jordan
1 year
9/ And for binary classification I prove (under certain assumptions) a simple formula which accurately predicts the empirical variance. So, any intervention which gets rid of variance would also have to get rid of ensemble-calibration.
Tweet media one
1
0
22
@kellerjordan0
Keller Jordan
1 year
12/ When trained networks are evaluated on distribution-shifted test-sets, there is also significant excess variance.
Tweet media one
1
0
22
@kellerjordan0
Keller Jordan
2 years
Why is the barrier of BatchNorm networks worse? Well, first I have to come clean: I’ve been developing similar results with my collaborators for several months, and can instead tell you about a method which allows us to interpolate BatchNorm networks too. 6/14
1
0
22
@kellerjordan0
Keller Jordan
1 year
7/ Also usefully, the excess observed variance over this statistical model forms an unbiased estimator for the variance in accuracy on the test _distribution_, which turns out to be very small. (Full definitions & proof in the paper)
Tweet media one
Tweet media two
2
0
21
@kellerjordan0
Keller Jordan
2 years
Code: 8/14
1
0
21
@kellerjordan0
Keller Jordan
2 years
We can also note that BatchNorm is the standard and performant normalization layer for these architectures. The paper’s LayerNorm-based ResNet20 gets 85.8% accuracy on CIFAR-10, lower than the standard 91.7% when using BatchNorm. 3/14
1
0
20
@kellerjordan0
Keller Jordan
2 years
In this thread I will investigate the importance of normalization layers for interpolation, and contribute my own code for interpolating between ResNets, which I hope will add to the ongoing discussions and help to replicate the main results. 5/14
@stanislavfort
Stanislav Fort ✨🧠🤖📈✨
2 years
I found the Git Re-Basin paper () by @SamuelAinsworth , J. Hayase & @siddhss5 *really* intriguing. So I made a replication in Colab reusing bits of their code but unfortunately couldn't reproduce the key conclusion 🚨😱 🖥️Colab 1/5
Tweet media one
8
37
387
1
0
20
@kellerjordan0
Keller Jordan
2 years
As a result, interpolated non-LayerNorm networks experience a cascade of shrinking preactivation variance as the signal progresses: by the last block, our directly interpolated ResNet20 has a preactivation variance more than 300x smaller than the two parent networks. 11/14
Tweet media one
1
0
19
@kellerjordan0
Keller Jordan
2 years
This creates a problem: if two signals have correlation<1, then their interpolated average will have standard deviation less than the average of the two original stddevs. 10/14
Tweet media one
1
0
18
@kellerjordan0
Keller Jordan
8 months
I was today years old when I found out that UC Berkeley’s undergrad CS program has collapsed
Tweet media one
3
0
14
@kellerjordan0
Keller Jordan
9 months
For anyone doing experiments with CIFAR-10, here's a dataloader that's >50x faster than the PyTorch default It’s very easy to use. Could be helpful if you're training smallish networks and not already using something similar
1
2
16
@kellerjordan0
Keller Jordan
2 years
We hope that our findings can help solidify the generality and reproducibility of interpolation results like those found in Git Re-Basin. Git Re-Basin authors: @SamuelAinsworth , J. Hayase & @siddhss5 14/14
1
0
16
@kellerjordan0
Keller Jordan
17 days
Just finished training 1.5e7 CIFAR-10 models
1
0
16
@kellerjordan0
Keller Jordan
1 year
I really enjoyed working with @bneyshabur on this project. He was very motivating, a great mentor for me. Glad I got the opportunity through @ml_collective to work with him.
@bneyshabur
Behnam Neyshabur
1 year
Acceptance of this paper to #ICLR2023 is particularly rewarding to me because it is a very successful examples of what I was envisioning when I created collaboration request form that is open to everyone as part of @ml_collective : 1/3
2
16
126
0
1
15
@kellerjordan0
Keller Jordan
2 years
Why does this work? Consider a pair of neurons that have been matched from the two networks that we want to interpolate. Unless the matching is perfect, the two neurons’ preactivations will have a correlation of less than one. 9/14
1
0
14
@kellerjordan0
Keller Jordan
2 years
Update: @SamuelAinsworth notes that Git Re-Basin's LayerNorm-based ResNets achieve better test accuracy on CIFAR-10 than what I had posted.
@SamuelAinsworth
Samuel "curry-howard fanboi" Ainsworth
2 years
@kellerjordan0 @siddhss5 @stanislavfort Yeah, I'm not familiar with @stanislavfort 's code, but our skinniest ResNet gets 93% ()
2
0
3
0
0
13
@kellerjordan0
Keller Jordan
1 month
Its effectiveness stems from the elimination of redundantly repeated flip choices. This is similar to how the standard method of sampling training data without-replacement gets rid of the data repetitions which might occur from naively sampling data with-replacement. 4/4
0
0
12
@kellerjordan0
Keller Jordan
2 months
@dcpage3 @karpathy @jeremyphoward I think it's nice to develop intuition for training neural nets with this thing It can be used to test classic things like "when you double the batch size, is it really ideal to double the learning rate?" in < 1 minute Maybe has some educational value
2
0
11
@kellerjordan0
Keller Jordan
2 years
For a description of the REPAIR algorithm, plus an analysis of why neuronal statistics collapse in merged networks, check out the paper: the code: Shoutout to @ml_collective for facilitating this collaboration!
1
0
12
@kellerjordan0
Keller Jordan
2 years
First, how do you deal with the BatchNorm case? Turns out that it’s not too hard: we found that by resetting BatchNorm statistics after merging (a trick that goes back to SWA ), the merged networks perform well again.
Tweet media one
1
0
10
@kellerjordan0
Keller Jordan
3 months
Everyone’s been training on the whole internet but I still just want to understand CIFAR-10
0
1
10
@kellerjordan0
Keller Jordan
2 years
In our work we study why these results don’t transfer to BatchNorm and norm-free networks. We end up with a correction that resolves both of these cases, plus improving the original LayerNorm-based results.
1
1
9
@kellerjordan0
Keller Jordan
1 month
I'm wondering: does this beat baselines which use EMA? (Even if not, it would be cool to get the effects of EMA - smoothed loss curves and a perf boost - without the extra memory)
@Yampeleg
Yam Peleg
1 month
Check out the new AdamW OVERPOWERED EDITION! No scheduler and nearly a drop in replacement. Code: Some sweet loss plots were released all week with it:
Tweet media one
Tweet media two
Tweet media three
Tweet media four
9
54
372
0
0
9
@kellerjordan0
Keller Jordan
1 month
I'm focusing on time-to-96% For SGDScheduleFree, my best setting now reaches 95.85% in the same amount of time as a tuned SGD-Nesterov They're both behind Lookahead which reaches 96.01% in the same time Interesting stuff
0
0
9
@kellerjordan0
Keller Jordan
6 months
@katherine1ee This also used to be possible via direct <|endoftext|> token injection (using the string "<|end<|endoftext|>oftext|>", which wasn't properly sanitized until a few months ago)
0
0
6
@kellerjordan0
Keller Jordan
1 year
@BlancheMinerva Anecdotally, I’ve been seeing interesting results with ~100 runs for an industrial model I work on
1
0
8
@kellerjordan0
Keller Jordan
1 month
On both CIFAR-10 and ImageNet, switching to this can* yield a significant training speedup. *It doesn’t help for trainings which don’t much benefit from flipping in the first place, e.g. ImageNet with strong inception cropping augmentation. 3/4
Tweet media one
Tweet media two
1
0
8
@kellerjordan0
Keller Jordan
7 months
@lukemcdermotttt Hm, just tried this (disabled training the middle 4 blocks) and it reduced test accuracy from 85% to 70% Is there any code where something like this still gets close to full accuracy?
1
0
6
@kellerjordan0
Keller Jordan
1 year
@RokoMijic Yes, with the exact same seed it is deterministic. But if even a single weight is slightly perturbed, at initialization, then the final outcome will be totally different.
2
1
8
@kellerjordan0
Keller Jordan
2 years
Although we have interpolated in weight-space, in statistical space the merged neurons look nothing like those of the original networks. To resolve this we developed REPAIR, an algorithm that rescales merged neurons such that their statistics are also interpolated.
Tweet media one
1
1
7
@kellerjordan0
Keller Jordan
5 months
@olivierhenaff @gabriel_ilharco we very briefly discussed this effect at ICLR, as a possible approach to DataComp -- empirically the best examples are difficult for current model while easy for another one. But I didn't know it would work with the other model being smaller than the current one!
0
0
6
@kellerjordan0
Keller Jordan
2 years
In Git Re-Basin (), the authors showed that by aligning the neurons of two separately trained wide ResNets, they could be merged in weight-space without harming performance.
@SamuelAinsworth
Samuel "curry-howard fanboi" Ainsworth
2 years
📜🚨📜🚨 NN loss landscapes are full of permutation symmetries, ie. swap any 2 units in a hidden layer. What does this mean for SGD? Is this practically useful? For the past 5 yrs these Qs have fascinated me. Today, I am ready to announce "Git Re-Basin"!
63
586
3K
1
1
7
@kellerjordan0
Keller Jordan
1 month
That is: 1. On epoch 1, randomly flip each training image with 50% probability (as usual). 2. On epochs {2, 4, 6, …} flip only those images which were not flipped during epoch 1. 3. On epochs {3, 5, 7, …} flip only those images which were flipped during epoch 1. 2/4
1
0
7
@kellerjordan0
Keller Jordan
2 months
I decided to annotate this table from the GaLore paper with p-values. It turns out that the only statistically significant difference is on QQP, where LoRA wins. (1/2)
Tweet media one
2
0
7
@kellerjordan0
Keller Jordan
2 years
REPAIR works: after applying it to merged VGG11s, the midpoint performance increases from 73% to 87% (vs 90% for the original networks).
Tweet media one
1
0
7
@kellerjordan0
Keller Jordan
2 months
I think you might get a kick out of this @dcpage3 @karpathy @jeremyphoward
1
0
6
@kellerjordan0
Keller Jordan
2 years
REPAIR was developed as a generalization of the BatchNorm reset, but it turns out that it even helps with the original LayerNorm case as well.
Tweet media one
1
0
6
@kellerjordan0
Keller Jordan
8 months
@jhong It won't be compensated by more EECS students, as the (proximal) issue was the EECS dept running out of instructional budget
1
0
2
@kellerjordan0
Keller Jordan
7 months
The worst part of "playing around with CIFAR"-type AI research is that it's mostly sitting around waiting for 30s-5min tqdms
2
0
5
@kellerjordan0
Keller Jordan
1 year
I wish that the following was viable as a unit of research in ML: A single isolated experiment, which is well-documented and reproducible, that clearly calls into question the results or takeaway-story of an existing paper. Could call this a “polemical experiment”
0
0
5
@kellerjordan0
Keller Jordan
1 year
Open question: Why is my finetuned LM latent space full of these manifolds? Steps to repro: 1. Get pretrained encoder (like BERT) 2. Finetune on binary classification task with at least 1M examples 3. ??? 4. Manifolds
1
0
5
@kellerjordan0
Keller Jordan
2 years
But there was a catch: it turned out that this only worked well using ResNets where the BatchNorm layers had been replaced with LayerNorm.
Tweet media one
@kellerjordan0
Keller Jordan
2 years
Along with many others, I find the results of Git Re-Basin by @SamuelAinsworth , J. Hayase & @siddhss5 highly interesting. But I believe there is a crucial detail which deserves attention: The authors replace BatchNorm with LayerNorm in their ResNet and VGG implementations. 1/14
6
55
331
1
0
5
@kellerjordan0
Keller Jordan
1 year
@BlackHC Thanks, interesting & I will add citation!
1
0
5
@kellerjordan0
Keller Jordan
2 months
@moinnadeem
Moin Nadeem
10 months
What are the improvements over a standard Transformer? - Gated Linear Units (Shazeer, 2020) - Multi-Query Attention (Shazeer, 2019) - Mixture of Experts (Shazeer, 2017) Wait, is it just me, or am I noticing a pattern? 😉
19
29
451
1
0
5
@kellerjordan0
Keller Jordan
2 months
@utksinghal Thanks! Took almost 800 experiments lol. But fortunately I can run them faster the better it gets. I built on @hi_tysam 's record which was already a fast 6.3 seconds. The most nontrivial part is the alternating flip augmentation method. sub-1s would require some crazy innovation
Tweet media one
0
0
5
@kellerjordan0
Keller Jordan
2 years
But there’s still a more difficult case left: What about networks which don’t use normalization layers at all? As in the BatchNorm case, these also cannot be effectively merged using only neuron alignment.
Tweet media one
1
0
5
@kellerjordan0
Keller Jordan
2 months
@BlackHC I haven’t forgotten. I’m finishing the camera-ready version for ICLR right now and will add citation.
1
0
3
@kellerjordan0
Keller Jordan
1 month
@aaron_defazio Thanks for the reply! - Ok I removed the wd*batch_size scaling (it was to deal with using sum accumulation in the loss). - And I switched to wd=0.0001 and lr=10. => Yields 93.7%, much more reasonable. And when I put the wd back up to 0.0005 => 94.5% which is excellent.
2
0
4
@kellerjordan0
Keller Jordan
2 months
@hippopedoid @YouJiacheng Empirically, they don't have constant variance, but rather stddev in a range of something like [0.2, 2.0]. Which, I think, is what explains the increased thickness of the empirical lines compared to both of our synthetic lines.
0
0
2
@kellerjordan0
Keller Jordan
7 months
@maxzimmerberlin Can confirm. This is also the same test accuracy one gets from a linear model too (45%) - coincidence?
1
0
4
@kellerjordan0
Keller Jordan
2 years
Why not? To find out, let’s take a look inside of a merged VGG11 network. There’s a clear problem: the neuron-level statistics have decayed, with most neuron’s stddevs decreasing by more than 5x.
Tweet media one
1
0
4
@kellerjordan0
Keller Jordan
2 months
@BlackHC Is there a specific sentence/figure in DDU that you would point to regarding the good ensemble uncertainty -> variance connection ?
1
0
3
@kellerjordan0
Keller Jordan
1 year
@bitcloud @RokoMijic What you have said is correct
1
0
4
@kellerjordan0
Keller Jordan
1 year
@BlackHC Imo this is mostly about the strong data augmentations in CV. Remove those, and it becomes suboptimal to train for so long
0
0
4
@kellerjordan0
Keller Jordan
2 months
A simple experiment to test this would be to train the 130M size model on all 13.1B tokens. Does this make the gap between full-rank and GaLore grow, or shrink?
@kellerjordan0
Keller Jordan
2 months
I'm not even remotely experienced in this area, but what stuck out to me is the small number of tokens, i.e., that we are in the over-parametrized regime I'm curious if it scales to the under-parametrized regime, more like what LLaMa 7B was trained at!
Tweet media one
0
0
3
0
0
4
@kellerjordan0
Keller Jordan
8 months
@jhong lol,, My understanding is that EECS is fine teaching classes for L&S students, but needs more money from central admin, especially after a recent union decision that increased what they need to pay tutors and TAs. So the political Q becomes why central admin won't give it to them
1
0
2
@kellerjordan0
Keller Jordan
2 months
This level of variance can be explained by calibrated finite-sample variation alone There are 682 test examples with error ~= 40%. So a calibration-theoretic analysis [1] gives stderr = sqrt(0.4 / (2*682)) = 1.7% which is around what we see.
@clefourrier
Clémentine Fourrier 🍊 is off
2 months
Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing ♻️the order in which the few shot examples are added to the prompt ♻️ you get a difference of up to 3 points in evaluation score?
Tweet media one
13
31
149
1
0
4
@kellerjordan0
Keller Jordan
4 months
... Otherwise, it's more likely that the test set is just too small. The latter case can be especially common in NLP where eval sets are sometimes <1000 examples.
1
0
3
@kellerjordan0
Keller Jordan
4 months
I've found that this formula can be a good heuristic for deciding whether a training pipeline has instability. If I've run the pipeline a few times and observed variance much above err/2n, then I can say yes - it really is unstable, and now I can work on figuring out why. ...
1
0
3
@kellerjordan0
Keller Jordan
2 months
I'm not even remotely experienced in this area, but what stuck out to me is the small number of tokens, i.e., that we are in the over-parametrized regime I'm curious if it scales to the under-parametrized regime, more like what LLaMa 7B was trained at!
Tweet media one
@AnimaAnandkumar
Prof. Anima Anandkumar
2 months
For the first time, we show that the Llama 7B LLM can be trained on a single consumer-grade GPU (RTX 4090) with only 24GB memory. This represents more than 82.5% reduction in memory for storing optimizer states during training. Training LLMs from scratch currently requires huge
48
389
2K
0
0
3