Matthew Leavitt Profile Banner
Matthew Leavitt Profile
Matthew Leavitt

@leavittron

Followers
2,232
Following
786
Media
185
Statuses
2,360

Chief Science Officer, Co-Founder @datologyai . Former: Head of Data Research @MosaicML ; FAIR. 🧠 and 🤖 intelligence // views are from nowhere

The Bay
Joined March 2011
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@leavittron
Matthew Leavitt
3 months
The next 10x in deep learning efficiency gains are going to come from intelligent intervention on training data. But tools for automated data curation at scale didn’t exist—until now. I’m so excited to announce that I’ve co-founded @DatologyAI , with @arimorcos and @hurrycane
11
16
126
@leavittron
Matthew Leavitt
11 months
As a neuroscientist imma call bullshit on this. All these "mind reading" techniques rely on an fmri scanner: a multimillion dollar, 10000lb+ machine that requires a purpose-built facility and you have to lie perfectly still in it for it to work. Nobody's stealing your thoughts
@0zne
Enzo Avigo
11 months
We’re basically done.
401
3K
12K
134
135
1K
@leavittron
Matthew Leavitt
1 year
v excited to finally announce our new work that formalizes one of the most effective practices for training LLMs—something that many industry leaders have conspicuously avoided discussing
Tweet media one
19
97
895
@leavittron
Matthew Leavitt
6 months
Tweet media one
7
27
420
@leavittron
Matthew Leavitt
7 months
There are like 5 people in all of deep learning who have actually looked at the pretraining data that a 7B+ model has been trained on (and three of them went mad)
11
12
291
@leavittron
Matthew Leavitt
1 year
By now you may have seen some hubbub about @MosaicML ’s MPT-7B series of models: MPT-7B base, MPT-7B-Instruct, MPT-7B-Chat, and MPT-7B-StoryWriter-65k+. These models were pretrained on the same 1T token data mix. In this 🧵I break down the decisions behind our pretraining data mix
Tweet media one
8
53
258
@leavittron
Matthew Leavitt
6 months
A few (somewhat data-centric) thoughts on the Gemini whitepaper 🧵: Can't be much more direct than this: "We find that data quality is critical to a highly-performing model". It feels especially true cuz they provide next to no information on the training data.
Tweet media one
3
20
187
@leavittron
Matthew Leavitt
7 months
It seems likely to me that Mistral 7B's quality comes from its data. You know, the thing they provide exactly zero information about. The sliding window attention is a red herring.
@AlphaSignalAI
𝐋𝐈𝐎𝐑⚡
7 months
Mistral just released the paper behind their impressive LLM: Mistral 7B. The model outperforms Llama2 13B on every benchmark. Architecture: - Uses Grouped-query attention (GQA) for faster inference -Uses Sliding Window Attention (SWA) to handle longer sequences at smaller
Tweet media one
7
11
92
11
9
174
@leavittron
Matthew Leavitt
1 year
s/o to @danielking36 for the exceptional title. We also considered "Training on the test set is all you need", "The Unreasonable Effectiveness of Training on the Test Set", and "Intriguing Properties of Training on Test Data"
1
2
156
@leavittron
Matthew Leavitt
4 years
Class selectivity is often used to interpret the function of individual neurons. @arimorcos and I investigated whether it’s actually necessary and/or sufficient for deep networks to function properly. Spoiler: it’s mostly neither. (1/10)
Tweet media one
6
34
108
@leavittron
Matthew Leavitt
7 months
Tweet media one
5
4
105
@leavittron
Matthew Leavitt
1 year
This is a red herring, of course. What everyone really wants to know (and what W&B will certainly keep as a close secret) is the Best Seed. Publicizing this seed would not only give away their competitive advantage, but also violate US Arms Control Laws.
@weights_biases
Weights & Biases
1 year
The average learning_rate logged to W&B in 2022 was 0.016
25
38
793
11
4
97
@leavittron
Matthew Leavitt
1 year
This was a huge headache in the early days of @MosaicML , so we built our tooling to seamlessly handle GPU failures. Our platform will detect a faulty node, pause training, cordon the node, sub in a spare, and resume from the most recent checkpoint. All w/o any human intervention
@ID_AA_Carmack
John Carmack
1 year
Hardware failures are common while training the largest machine learning models across thousands of GPUs. It is similar to the elder days of computers, when a vacuum tube burning out during your batch computation was a real issue.
53
102
2K
3
5
89
@leavittron
Matthew Leavitt
11 months
Celebrate GPU Independence Day! My colleagues at @MosaicML just showed how simple it is to train on AMD. The real kicker here is switching between AMD and NVIDIA in a single training run
@abhi_venigalla
Abhi Venigalla
11 months
And yes, you can switch back and forth between NVIDIA and AMD, even within a single training run. It's Christmas in July!🎄
Tweet media one
9
46
426
1
15
91
@leavittron
Matthew Leavitt
2 years
Me and my talented colleagues at @MosaicML made ResNet50 go brrrrrr. We devised three training recipes for a vanilla ResNet-50 architecture that are up to 7x faster than other baselines. We didn't even sweep hparams extensively. And it's plain PyTorch. @jefrankle has the scoop:
@jefrankle
Jonathan Frankle
2 years
Introducing the *Mosaic ResNet*, a new take on a CV workhorse that sets SOTA for efficiency at any ImageNet accuracy. The recipe uses 12 techniques that change the math of training for a 7x speedup over standard baselines + up to 3.8x over the latest work.
7
69
368
3
11
89
@leavittron
Matthew Leavitt
1 year
@typedfemale You may not like it, but this is what peak performance looks like
Tweet media one
1
3
83
@leavittron
Matthew Leavitt
3 years
Vision Transformers: Acronyms Are All You Need
Tweet media one
3
8
79
@leavittron
Matthew Leavitt
11 months
As Head of the Data Research Team at @MosaicML , I cannot think of an acquirer I'd be more excited about
@NaveenGRao
Naveen Rao
11 months
Today we’re announcing plans for @MosaicML to join forces with @databricks ! We are excited at the possibilities for this deal including serving the growing number of enterprises interested in LLMs and diffusion models.
58
66
665
5
2
81
@leavittron
Matthew Leavitt
4 years
I'm going to take this opportunity to recommend that everyone read Paul Cisek's1999 paper "Beyond the computer metaphor: Behaviour as interaction" which presages many of the contemporary discussions about the necessity of embodiment for overcoming limitations in deep learning
@tyrell_turing
Blake Richards
4 years
@dileeplearning Nah man, see the tweet I quoted. Most people think it is a metaphor, cause they think computer == Von Neumann machine.
2
0
8
3
13
71
@leavittron
Matthew Leavitt
6 months
TFW cosmic rays ruin your training run. To be fair, most SDC events probably aren't due to cosmic rays, but it's fun to think about the universe extending a glittering tendril into the delicate gears of your trainer and whispering "nope".
Tweet media one
3
6
73
@leavittron
Matthew Leavitt
3 years
@arimorcos and I are excited to announce our position paper, Towards falsifiable interpretability research, is part of #NeurIPS2020 @MLRetrospective ! We argue for the importance of concrete, falsifiable hypotheses in interpretability research. Paper: (1/8)
3
6
66
@leavittron
Matthew Leavitt
1 year
New LR schedule just dropped
@MaxGhenis
Max Ghenis
1 year
Friends, colleagues, may I present to you: the California Marginal Tax Rate Schedule
Tweet media one
68
300
3K
1
2
64
@leavittron
Matthew Leavitt
6 years
My question is no longer rhetorical: Let's get data on this. If you or someone you know was prevented from attending SfN by the travel ban, please fill out this form: . I want (everyone) to know exactly how much damage this policy is causing
1
67
56
@leavittron
Matthew Leavitt
11 months
@marcbeaupre You need to generate 3-7T of magnetic field strength, which requires a large magnet, lots of power, and helium cooling. I dunno what the physical limits are on magnet size for field generation; also power consumption/dissipation seem like big issues
5
2
60
@leavittron
Matthew Leavitt
3 years
@neuroecology Totally forgot about this one until today
Tweet media one
1
16
55
@leavittron
Matthew Leavitt
3 years
Now that we're out of stealth I'm very excited I can announce I'm a Research Scientist at @MosaicML . We help the ML community burn less money by training models more efficiently. There's a lot of fascinating research and engineering that enables this. And we're hiring 😀
@DbrxMosaicAI
Databricks Mosaic Research
3 years
Hello World! Today we come out of stealth to make ML training more efficient with a mosaic of methods that modify training to improve speed, reduce cost, and boost quality. Read our founders' blog by @NaveenGRao @hanlintang @mcarbin @jefrankle (1/4)
Tweet media one
7
41
164
4
3
54
@leavittron
Matthew Leavitt
9 months
This is why I pushed @MosaicML to create a Data Research Team last year (and @jefrankle recognized the value and made it happen)
@omarsar0
elvis
9 months
From the papers that I've read on LLMs in the past 6 months, one thing is clear: higher data quality will be key to keep pushing progress. Lots of companies and researchers keep innovating and implementing ways to improve data quality in all areas ranging from finetuning LLMs
12
35
231
1
1
54
@leavittron
Matthew Leavitt
10 months
Very cool to see what is essentially SemDedup () work for fine-tuning data
@abacaj
anton
10 months
Getting good results by filtering some public datasets. You'll find lots of duplicates. Filter by instruction similarity score > .95 (cosine) using e5-large-v2. After filtering sort the dataset by instruction length ascending order, this gave best loss + benchmark scores
Tweet media one
10
29
208
1
6
49
@leavittron
Matthew Leavitt
11 months
Also, most importantly, these studies aren't decoding endogenously generated signals, they're reconstructing WHAT THEY ARE CURRENTLY SHOWING YOU
@josueortc
Josué Ortega Caro
11 months
@NeuroStats @leavittron Also it’s stimulus reading not mind reading.
1
0
5
6
3
44
@leavittron
Matthew Leavitt
1 year
Are LLMs hugely overhyped? yes, just look at the cryptobros jumping on the bandwagon and meaningless AI references in co's copy and strategy Flash in the pan? No. This tech is going to get integrated into everything.
4
0
43
@leavittron
Matthew Leavitt
6 months
Unfortunately A/B testing is tough: it requires lots of subjects and/or well-defined use patterns. In lieu of that, my favorite eval method is "find someone who has spent way too much time using way too many models and ask them to do a vibe check". Reviewers don't love this tho.
Tweet media one
2
5
45
@leavittron
Matthew Leavitt
1 year
Tweet media one
@arankomatsuzaki
Aran Komatsuzaki
1 year
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis Studies what would happen if we train LLM with repeated data and how we can alleviate the LLM mult-epoch degradation.
Tweet media one
2
19
128
0
0
39
@leavittron
Matthew Leavitt
11 months
@marcbeaupre Overall, I suspect miniaturization would require a massive breakthrough in materials science. The tech has already been around for 30+ years. I'd sooner bet on a different brain imaging modality than fmri miniaturization, but I'm also not an fmri expert
3
0
33
@leavittron
Matthew Leavitt
8 months
iykyk
Tweet media one
6
0
37
@leavittron
Matthew Leavitt
1 year
C4 Part 2: Multiepoch pretraining isn’t really a thing in NLP because…tradition? Superstition? Our initial experiments actually showed its actually totally fine for ≤8 epochs (more experiments to come!), so we trained on our SemDedup’d C4 for 2.98 epochs (299B tokens)
7
2
37
@leavittron
Matthew Leavitt
3 years
Very excited to announce that I've joined @hanlintang and @NaveenGRao in their quest to make ML more efficient!
3
1
37
@leavittron
Matthew Leavitt
6 months
The Gemini whitepaper also emphasizes the importance of training the tokenizer on a “large sample” of the dataset. IMO tokenizers as a vector for model improvement are vastly underexploited. Data curation and tokenization both suffer because researchers overlook data.
Tweet media one
1
1
35
@leavittron
Matthew Leavitt
6 months
Tweet media one
1
2
33
@leavittron
Matthew Leavitt
3 years
Related question that @KordingLab and I have: is there a literature or materials on how to build strong hypotheses, esp in neuroscience? Most philosophical work we're familiar with is too abstract/meta to feel practical, esp as part of a graduate curriculum
@KordingLab
Kording —-& Lab 🦖
3 years
Research on interpreting units in artificial neural networks fails to be falsifiable. And just about everything that Matt Leavitt and @arimorcos say about the problem in ANNs is a problem in neuroscience.
7
41
172
3
6
31
@leavittron
Matthew Leavitt
11 months
@ItsMrMetaverse I actually escaped my neuroscience lock-up (they let us do that once in a while) and have been doing ML research for the last four years. But as a Metaverse Expert and t-shirt merchant you seem uniquely qualified to evaluate the trustworthiness of my statements about neuro and ML
3
2
34
@leavittron
Matthew Leavitt
11 months
@jbensnyder This is an excellent point. All the studies I'm familiar with require training data for each individual, which is another limitation
6
0
33
@leavittron
Matthew Leavitt
6 months
Despite Gemini explicitly acknowledging the importance of data quality, I’m sure ML twitter will keep perseverating on the importance of architecture choices like the “efficient attention mechanisms” that the report also mentions
Tweet media one
1
3
32
@leavittron
Matthew Leavitt
5 years
Congratulations to @tyrell_turing for winning the @CAN_ACN Young Investigator Award for 2019! It must have been very challenging to pick from all the amazing young Canadian PIs. Thanks to Blake & everyone who makes Canada's neuroscience community so wonderful to be a part of!
3
2
28
@leavittron
Matthew Leavitt
1 year
@finbarrtimbers At @MosaicML we did it with Alibi + FlashAttention + 80gb A100s. No secret sauce, just well-vetted research. Shout-out to @OfirPress and @tri_dao for their great methods!
1
2
32
@leavittron
Matthew Leavitt
1 year
A summary and a few thoughts on SlimPajama 🧵
@CerebrasSystems
Cerebras
1 year
📣 New dataset drop! Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. 🧵
Tweet media one
14
191
685
1
6
29
@leavittron
Matthew Leavitt
4 years
QUARANTINE DOG UPDATE: We were out of hot dogs buns, so we added some to the Costco order. Costco didn't have buns, so they substituted...3 DOZEN HOT DOGS. We now have 52 hot dogs and no buns. But I think this is the madness what we all came here for.
Tweet media one
3
0
28
@leavittron
Matthew Leavitt
11 months
To those saying "but what about the inexorable march of technological progress"
@leavittron
Matthew Leavitt
11 months
@paulg I'm not saying it can never happen, just that it's probably not worth worrying about atm due to the logistics of generating the strength of magnetic field needed to do it.
1
1
16
1
2
30
@leavittron
Matthew Leavitt
7 months
The next 10x in efficiency gains will be from data curation
@abacaj
anton
7 months
What's next for LLMs? Just go big? More data more parameters? Seems like maybe this path will be exhausted soon or too expensive
97
13
297
3
4
30
@leavittron
Matthew Leavitt
1 year
Tweet media one
2
0
26
@leavittron
Matthew Leavitt
1 year
Next up: C4. Our initial exps showed C4 just performed _really_ well. But we wanted to push it! We used SemDedup (ty @arimorcos ' group) to remove the 20% most similar documents within C4, which was consistently :thumbsup: in our exps
@arimorcos
Ari Morcos
1 year
Web-scale data has driven the incredible progress in AI but do we really need all that data? We introduce SemDeDup, an exceedingly simple method to remove semantic duplicates in web data which can reduce the LAION dataset (& train time) by 2x w/ minimal performance loss. 🧵👇
Tweet media one
7
59
310
1
2
27
@leavittron
Matthew Leavitt
4 years
How do easily interpretable neurons affect CNN performance? In a new blog post, @arimorcos and I summarize our recent work evaluating the causal role of selective neurons: easily interpretable neurons can actually impair performance!
2
5
25
@leavittron
Matthew Leavitt
4 years
Quarantine Dog #1 : double dog, Tillamook cheddar, sauerkraut, avocado, pickled ginger, Japanese mayo, yuzu kosho.
Tweet media one
4
0
26
@leavittron
Matthew Leavitt
2 years
One of my favorite parts of our blog post announcing the @MosaicML ResNet Recipes () is the recipe card, designed by the talented @ericajiyuen . BTW these times are for 8x-A100
Tweet media one
0
4
26
@leavittron
Matthew Leavitt
6 months
Gemini also continues the trend of training small models for looonger. As deep learning models transition from research artifact to production necessity, inference costs are going to increasingly dominate the economics. Llongboi just keeps getting llonger:
@NaveenGRao
Naveen Rao
1 year
Ok, for those wondering about the origin of our nickname "Llongboi", here it is. ( @jefrankle got mad at me for putting this in the wild. Once it's free, it's free!)
1
0
22
1
0
26
@leavittron
Matthew Leavitt
5 years
All this talk of neural coding and computation by @RomainBrette @tyrell_turing @andpru @Neuro_Skeptic et al. reminds me to remind everyone to read Paul Cisek's excellent (and imo overlooked) paper "Beyond the Computer Metaphor: Behavior as Interaction"
3
1
24
@leavittron
Matthew Leavitt
2 years
Good compute is terrible thing to waste, so @abhi_venigalla and I assembled some best practices for efficient CNN training and put them into a blog post.
@DbrxMosaicAI
Databricks Mosaic Research
2 years
New blog post! Take a look at some best practices for efficient CNN training, and find out how you can apply them easily with our Composer library: #EfficientML
1
18
80
0
4
24
@leavittron
Matthew Leavitt
2 years
Hot take inspired by ConvNeXt : Grouped convs are overrated. They're popular bc obsession w/ inference throughput & raw accuracy, disregard for training cost, & FLOPs-hacking. Vanilla convs are pareto-superior unless training is ~free relative to inference
1
1
21
@leavittron
Matthew Leavitt
5 months
Like @OpenAI , @BuceesUSA offers employees PPUs instead of RSUs and has a capped profit model because the success of their mission will be so transformative to society that it would be unethical for them to capture all of the resulting value
@JohnArnoldFndtn
John Arnold
5 months
At $125k+, the car wash manager at a (very large) gas station in rural Texas makes more than most doctors in Europe.
Tweet media one
644
676
10K
1
3
23
@leavittron
Matthew Leavitt
1 year
Every time, smh
Tweet media one
1
2
22
@leavittron
Matthew Leavitt
6 months
Gemini Ultra training was distributed across datacenters! Model parallel within SuperPods (and datacenters) and data parallel across SuperPods (and datacenters)! This is impressive in part because gradients are notoriously shy and reluctant to leave their home datacenter.
Tweet media one
1
1
22
@leavittron
Matthew Leavitt
4 years
Can't explain why, but wearing a suit to walk my dog during a pandemic makes me feel LESS unhinged. Stay fitted, stay sane 💕😷
Tweet media one
1
0
21
@leavittron
Matthew Leavitt
2 years
Very excited to see @MosaicML used as a baseline, especially by work from @aleks_madry 's lab. It showcases the massive speedups that can be achieved by combining thoughtful modifications to the training algorithm + well-applied systems knowledge. Eagerly anticipating the paper!
@aleks_madry
Aleksander Madry
2 years
ImageNet is the new CIFAR! My students made FFCV (), a drop-in data loading library for training models *fast* (e.g., ImageNet in half an hour on 1 GPU, CIFAR in half a minute). FFCV speeds up ~any existing training code (no training tricks needed) (1/3)
Tweet media one
29
390
2K
0
1
22
@leavittron
Matthew Leavitt
4 years
Excited to for my #neuromatch2020 talk, today at 4pmPST/7pmEST/11pmGMT. It's a summary of my recent work with @arimorcos . If you miss free samples at the market, this is the next best thing. Come taste our work & if you like it read the paper!
@leavittron
Matthew Leavitt
4 years
Class selectivity is often used to interpret the function of individual neurons. @arimorcos and I investigated whether it’s actually necessary and/or sufficient for deep networks to function properly. Spoiler: it’s mostly neither. (1/10)
Tweet media one
6
34
108
1
1
20
@leavittron
Matthew Leavitt
4 years
Big thanks to @KordingLab , @bradpwyble , & @neuralreckoning for organizing #neuromatch2020 , @DavideValeriani for moderating my talk, & everyone who asked questions (no idea who, plz say hi if you wish). The experience has been a potent salve for the Coronavirus Blues!
@leavittron
Matthew Leavitt
4 years
Excited to for my #neuromatch2020 talk, today at 4pmPST/7pmEST/11pmGMT. It's a summary of my recent work with @arimorcos . If you miss free samples at the market, this is the next best thing. Come taste our work & if you like it read the paper!
1
1
20
1
0
21
@leavittron
Matthew Leavitt
1 year
@code_star , @_BrettLarsen , @iamknighton , and @jefrankle (yes, our Chief Scientist gets his hands dirty) put in a TON, and we couldn’t be happier with how the MPT-7B series of models turned out. And we're just getting started.
2
0
21
@leavittron
Matthew Leavitt
11 months
choosing LLM pretraining data like
Tweet media one
0
0
20
@leavittron
Matthew Leavitt
8 months
~2yrs ago @nsaphra came to my poster & we discussed regularizing to ctrl interpretability. She mentioned a superstar grad student ( @_angie_chen ). Things really got wild when @ziv_ravid joined the party. And @kchonyc graced us w/ wisdom throughout. V excited to finally announce:
@_angie_chen
Angelica Chen
8 months
New work w/ @ziv_ravid @kchonyc @leavittron @nsaphra : We break the steepest MLM loss drop into *2* phase changes: first in internal grammatical structure, then external capabilities. Big implications for emergence, simplicity bias, and interpretability! 🧵
Tweet media one
2
62
351
1
3
20
@leavittron
Matthew Leavitt
4 years
My mom had to cancel the education conference she was organizing 😭 but got v excited when she heard about #neuromatch2020 and wants to organize something similar. @bradpwyble @neuralreckoning @KordingLab @titipat_a et al., do you have resources or a "how-to"? ❤️❤️❤️
3
0
17
@leavittron
Matthew Leavitt
1 year
This is why I went to grad school
@code_star
Cody Blakeney
1 year
The original LLongboi (drawing by @leavittron ) secretly meming this code name into existence is one of my proudest moments at @MosaicML
Tweet media one
2
3
28
2
0
20
@leavittron
Matthew Leavitt
11 months
@JonLamArt Go right ahead! Always happy to chat.
2
0
18
@leavittron
Matthew Leavitt
11 months
It's nuts how often I see slack notifications that we closed a new customer. Those three sales reps, @barrydauber , @mrdrjennings , and @stewartsherpa , are UNSTOPPABLE. Glad to see their hard work being recognized!
2
2
18
@leavittron
Matthew Leavitt
2 years
A haiku for the research scientists, at @hanlintang 's suggestion: Don't want to be here Please don't, no kubernetes So much to live for
@hanlintang
Hanlin Tang
2 years
ML scientist, meet ML infrastructure.
Tweet media one
11
93
1K
0
3
19
@leavittron
Matthew Leavitt
1 year
I agree that not having experience training neural networks/not knowing the math underlying them shouldn't auto-invalidate one's AI takes. But "my AI takes are valid because deep learning doesn't use Real Math" is worse than wrong (more on that below) and weirdly fetishizes math
@ESYudkowsky
Eliezer Yudkowsky ⏹️
1 year
Would-be AI gatekeepers: YoU caN't saY aNythIng abOUt AI unLess yoU - Look, I *remember* when AI used to involve math, maybe not Actual Mathematician Math, but at least nontrivial computer science. Modern deep learning is calculus for bright eleven-year-olds, plus the first
Tweet media one
125
150
2K
1
2
19
@leavittron
Matthew Leavitt
8 months
Very excited to announce that our work received a Spotlight Rejection at @NeurIPSConf #NeurIPS
@leavittron
Matthew Leavitt
1 year
v excited to finally announce our new work that formalizes one of the most effective practices for training LLMs—something that many industry leaders have conspicuously avoided discussing
Tweet media one
19
97
895
0
0
19
@leavittron
Matthew Leavitt
4 years
"I want to see blood. We all want to see blood" - @KordingLab . I've got to say, so far the worst part of #neuromatch2020 so far is that @KordingLab can't spice up the debate by sliding @tyrell_turing a folding chair when Cisek has his back turned.
1
0
19
@leavittron
Matthew Leavitt
2 years
@_arohan_ Funny you should say this. Composer (, @MosaicML 's library for efficient training) has this feature, but it adjusts grad accum instead of batch size, so the math is preserved. We're going to release it and announce it in a blog post very soon.
3
0
17
@leavittron
Matthew Leavitt
9 months
Zack worked his ass off for this paper and the reviewer responses (like he works his ass off for everything). This is extremely disappointing and I think this policy causes more harm than good.
@ZackAnkner
Zack Ankner
9 months
My EMNLP paper got desk-rejected post-rebuttal because I posted it to arxiv 25 minutes after the anonymity deadline. I was optimistic about our reviews, so I spent a whole week while visiting my family writing rebuttals and coding experiments to respond.
3
28
187
0
0
17
@leavittron
Matthew Leavitt
11 months
Thrilled to have contributed to this. And excited to see what the community does with it!
@DbrxMosaicAI
Databricks Mosaic Research
11 months
Meet MPT-30B, the latest member of @MosaicML 's family of open-source, commercially usable models. It's trained on 1T tokens with up to 8k context (even more w/ALiBi) on A100s and *H100s* with big improvements to Instruct and Chat. Take it for a spin on HF!
Tweet media one
17
129
550
1
0
17
@leavittron
Matthew Leavitt
1 year
@finbarrtimbers @MosaicML @OfirPress @tri_dao We pretrained at 2048 then fine-tuned on 65k. We tried generation up to 84k. There are trucks we could use to push it further, but we wanted it to be simple for others to use. Dunno if you saw, but we used it to generate an epilogue to The Great Gatsby:
2
0
18
@leavittron
Matthew Leavitt
1 year
Overall, the tools & dataset are great for the community. I'm glad people are realizing that data work is valuable and not dismissing it as low-status. 9 out of 10 pediatricians recommend not feeding your child trash. I hope the ML community soon feels this way about LLMs🤞🤞🤞
0
0
16
@leavittron
Matthew Leavitt
11 months
@paulg I'm not saying it can never happen, just that it's probably not worth worrying about atm due to the logistics of generating the strength of magnetic field needed to do it.
@leavittron
Matthew Leavitt
11 months
@marcbeaupre You need to generate 3-7T of magnetic field strength, which requires a large magnet, lots of power, and helium cooling. I dunno what the physical limits are on magnet size for field generation; also power consumption/dissipation seem like big issues
5
2
60
1
1
16
@leavittron
Matthew Leavitt
1 year
ML conferences will ban submissions using generative LLMs, but they won't ban submissions with the title "x is All You Need" or "Intriguing Properties of x"
4
1
18
@leavittron
Matthew Leavitt
3 years
@andpru @KordingLab Most of what I learned in my PhD was conveyed implicitly, and even the explicit channels were typically code comments or oral history. I had a course on "research conduct", but that was basically "Retraction Watch's Greatest Hits".
1
2
15
@leavittron
Matthew Leavitt
1 year
My man was crazy close. Someone give him a prize. Real numbers are 340B and 7e24 FLOPs. @CNBC doesn't need to wait for leaks, they should just ask @abhi_venigalla .
@abhi_venigalla
Abhi Venigalla
1 year
Alright who wants to try and guess the compute/cost/params for PaLM2-L? No prizes (b/c obv I don't know) but with enough responses we might get a reasonable estimate (which is reward enough 😝) I'll start: * 6e24 FLOPs * $22M * 250B params paper:
14
8
79
1
2
17
@leavittron
Matthew Leavitt
1 year
Me when the new scaling laws hit
Tweet media one
0
1
17
@leavittron
Matthew Leavitt
8 months
Tweet media one
0
0
16
@leavittron
Matthew Leavitt
1 year
This data mix was a bit of a hedge, but it seems to have turned out quite well. We're excited about what will happen as we get more scientific and methodical about data. The field overlooks data research, and we're working to fix that.
1
0
16
@leavittron
Matthew Leavitt
2 years
Does this make early stopping analogous to eating veal?
@ilyasut
Ilya Sutskever
2 years
it may be that today's large neural networks are slightly conscious
453
562
3K
2
1
15
@leavittron
Matthew Leavitt
1 year
Big shout out to @CerebrasSystems for building the tools and dataset and releasing both. Very glad that data work is getting the attention it needs. Though I don't see the tools anywhere on your github. Am I looking in the right place?
@CerebrasSystems
Cerebras
1 year
📣 New dataset drop! Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. 🧵
Tweet media one
14
191
685
1
0
16
@leavittron
Matthew Leavitt
1 year
Next up: RedPajama is @togethercompute ’s commendable attempt to recreate the LLaMa data. Many of their sources (e.g. Wikipedia, StackExchange, arXiv) are already available as ready-to-use datasets elsewhere, but RedPajama contains data through 2023—the freshness is appealing
1
0
15
@leavittron
Matthew Leavitt
4 years
Studying neuroscience doesn't make you a neuroscientist. Actual neuroscientists... - draw the Felleman and Van Essen diagram from memory - compute XOR in single astrocytes - reconstruct detailed biographies from c-Fos levels - optogenetically induce consciousness in macaques
@emmunologie
Erin
4 years
Studying Immunology doesn’t make you an Immunologist. Actual Immunologists... - know the name and function of every single cytokine in existence - never clog the cytometer - love both T and B cells equally - are immune to all diseases
9
22
174
0
1
15
@leavittron
Matthew Leavitt
6 months
One very relevant consequence of token budgets increasing is that the need for data curation also increases! The quantity (and possibly even proportion 😱) of redundant, noisy, and misleading examples increases with the size of your dataset!
1
0
16
@leavittron
Matthew Leavitt
11 months
@SpiderMonkeyXYZ I'm familiar with the study. It's great research! What I'm calling bullshit on is the idea that that "your thoughts aren't safe" or that you should be concerned about someone stealing your dreams
1
1
13
@leavittron
Matthew Leavitt
6 months
Would be great for someone to build some data curation tools suited to contemporary pretraining practices
@XueFz
Fuzhao Xue
6 months
Great work! Once again, it highlights the implicit repetition of training tokens. While the Chinchilla law is commendable, it's clear it won't endure indefinitely. As models grow larger, the Language Model (LLM) assimilates knowledge from implicitly repeated tokens. This is
0
9
67
1
0
15
@leavittron
Matthew Leavitt
1 year
@ruthhook_ Maybe "intelligent" people just introspect more, complain more, or have more medical care
0
0
13
@leavittron
Matthew Leavitt
1 year
Bad news: that's just a dummy model being used to test our new hardware Good news: our new hardware are H100s
@abacaj
anton
1 year
New MPT-30B sighting?
6
2
53
1
0
15
@leavittron
Matthew Leavitt
3 years
I contributed to this and it feels good. Use it for SSL + transformers, then tag me in the github issue if you run into problems!
@PyTorch
PyTorch
3 years
Introducing VISSL () - a library for reproducible, SOTA self-supervised learning for computer vision! Over 10 methods implemented, 60 pre-trained models, 15 benchmarks, and counting.
Tweet media one
10
257
1K
0
0
15
@leavittron
Matthew Leavitt
3 years
TFW a full research team at Google scoops your grad project. At least you know it was a good idea!
@_akhaliq
AK
3 years
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet pdf: github: transformer-style networks without attention layers make for surprisingly strong image classifiers
Tweet media one
4
63
332
0
0
15
@leavittron
Matthew Leavitt
1 year
My soccer team composed entirely of coaches will be unbeatable
@AlexReibman
Alex Reibman 🖇️
1 year
Tough news for FAANG engineers: Startups don’t want you anymore Every founder I’ve spoken to is tired of them. Instead, they’re hiring ex-founders— engineers who value ownership and grinding to win
191
111
2K
0
0
15
@leavittron
Matthew Leavitt
4 years
@PsychScientists Coffee is actually very high-dimensional and this is what happens when you project it into two dimensions. It actually goes quite nicely with the Swiss Roll problem. Some people recommend using tea-SNE, but it's just not the same.
0
0
14
@leavittron
Matthew Leavitt
3 years
I'd say Harvard won the lottery here, but I think the real beneficiaries are everyone who gets to work with you. Congratulations, Jonathan!
@jefrankle
Jonathan Frankle
3 years
I guess the word is out! I'll be joining the @Harvard faculty in the fall of 2023 as part of an amazing cohort of new machine learning professors. Looking forward to sharing more about my lab, how to join, and everything we're building at @hseas when I'm a bit closer to arriving!
37
12
407
0
1
14