davisblalock Profile Banner
Davis Blalock Profile
Davis Blalock

@davisblalock

Followers
15K
Following
368
Media
494
Statuses
1K

Research scientist @GoogleDeepMind. Past: @Databricks, first hire @MosaicML, @MIT PhD. I post about AI technical progress + sometimes the business side.

San Francisco, CA
Joined December 2016
Don't wanna be here? Send us removal request.
@davisblalock
Davis Blalock
3 years
I've written about 500+ machine learning papers in the past year. Here are some of my most popular threads: [1/n]
14
72
402
@leavittron
Matthew Leavitt
3 months
The era of "The Era of Pretraining is Over" is over
@pratyushmaini
Pratyush Maini
3 months
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance
4
7
53
@jefrankle
Jonathan Frankle
3 months
Not that I have a favorite recent project, but... 🧵 LLM judges are the popular way to evaluate generative models. But they have drawbacks. They're: * Generative, so slow and expensive. * Nondeterministic. * Uncalibrated. They don't know how uncertain they are. Meet PGRM!
@alexrtrott
Alex Trott
3 months
Ever wonder what it'd look like if an LLM Judge and a Reward Model had a baby? So did we, which is why we created PGRM -- the Prompt-Guided Reward Model. TLDR: You get the instructability of an LLM judge + the calibration of an RM in a single speedy package (1/n)
4
15
77
@davisblalock
Davis Blalock
4 months
Anyway, I'm not a lawyer and I make no claims about what should happen with AI and public data, but I hope this thread helped you understand a bit about what is happening. Further disclaimer: all of this is general information that hundreds, maybe thousands of people inside the
17
11
641
@davisblalock
Davis Blalock
4 months
A consequence of this caring-about-rules spectrum is that there's a "compliance gap" in many evals. If you have an idea where a given company (or even academic lab) sits, you should be either more or less impressed by their numbers.
1
7
441
@davisblalock
Davis Blalock
4 months
Legal limbo also means that both compliant and non-compliant actors look the same from the outside. The non-compliant say nothing about their data because they did violate rules, and the compliant say nothing because they only incur risk by saying anything.
1
14
560
@davisblalock
Davis Blalock
4 months
This legal ambiguity gives rise to hilarious situations like technical leaders saying they "can't recall" what went into their models.
2
18
659
@davisblalock
Davis Blalock
4 months
Second is that the law isn't settled here. This means that no one can safely say anything about their training data. They can't even say what they didn't train on because failing to say you didn't train on X could be construed as evidence that you did.
3
15
521
@davisblalock
Davis Blalock
4 months
The compliant players have two problems. First is that you can't really get opt-in from the whole internet. Which means you can't get 100% contract coverage. Unless you're willing to not train on most of the internet, but that's throwing away a lot of useful data.
3
10
542
@davisblalock
Davis Blalock
4 months
Paying more to cover their butts makes sense for them because: 1. It's worth it to reduce legal risk 2. Employees and leaders want to "do the right thing," or at least brand themselves as the good guys internally 3. It helps them acquire risk-averse corporate customers
2
10
697
@davisblalock
Davis Blalock
4 months
The group that does care about compliance is more interesting. They strike deals with publishers and copyright holders whenever possible. They buy books. They carefully filter out content that might have been pirated. Etc.
3
10
667
@davisblalock
Davis Blalock
4 months
So they train on anything they can get their hands on, including outputs from rival models (in violation of the terms of service).
2
9
639
@davisblalock
Davis Blalock
4 months
The companies that don't care either aren't subject to these laws or are more worried about dying from irrelevance than lawsuits. Legal battles can drain you but not having a good enough product can kill you.
1
24
837
@davisblalock
Davis Blalock
4 months
There are roughly two groups of actors: 1. Those that care about US + EU laws and regulations. 2. Those that don't. But both look the same from the outside.
6
25
908
@davisblalock
Davis Blalock
4 months
While I briefly have no employer, let me tell you what's really happening with AI companies training on public data: [1/n]
71
339
4K
@davisblalock
Davis Blalock
4 months
Paper is here: https://t.co/o8Erc1YdY3 If you like this thread, consider sharing it or following my awesome coauthors @saanarkethayan (with a banger of a first-ever first-author paper) @gupta__abhay and @mansiege. Happy to answer questions in the comments. [11/11]
Tweet card summary image
arxiv.org
Large Language Model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging. It is currently...
7
10
179
@davisblalock
Davis Blalock
4 months
(except that it was way harder than that, we eliminated a bunch of hparams, and we got it to work at scale + for all the linears instead of just some of them). [10/11]
1
0
48
@davisblalock
Davis Blalock
4 months
tl;dr: we saw muParametrization and Unit Scaling and were like:
1
0
62
@davisblalock
Davis Blalock
4 months
Our changes also let you transfer hyperparameters to orders-of-magnitude larger models. I.e., you can tune your hparams on small models and trust that they’ll be good for large models too. This saves *tons* of compute vs sweeping hparams on large models. [8/11]
3
2
86
@davisblalock
Davis Blalock
4 months
Making all these changes lets you just straight up cast your weights to fp8 during training, with no dynamic rescaling or anything. Plus you get better convergence, even in bf16. [7/11]
1
0
69
@davisblalock
Davis Blalock
4 months
Second, instead of just adding the residual branch output to the residual stream, we take a linear combination of the two. If both input tensors have unit norm and you make the squared coefficients sum to 1, the output tensor tends to have unit norm as well. [6/11]
1
0
68