Davis Blalock
@davisblalock
Followers
15K
Following
368
Media
494
Statuses
1K
Research scientist @GoogleDeepMind. Past: @Databricks, first hire @MosaicML, @MIT PhD. I post about AI technical progress + sometimes the business side.
San Francisco, CA
Joined December 2016
I've written about 500+ machine learning papers in the past year. Here are some of my most popular threads: [1/n]
14
72
402
The era of "The Era of Pretraining is Over" is over
1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance
4
7
53
Not that I have a favorite recent project, but... 🧵 LLM judges are the popular way to evaluate generative models. But they have drawbacks. They're: * Generative, so slow and expensive. * Nondeterministic. * Uncalibrated. They don't know how uncertain they are. Meet PGRM!
Ever wonder what it'd look like if an LLM Judge and a Reward Model had a baby? So did we, which is why we created PGRM -- the Prompt-Guided Reward Model. TLDR: You get the instructability of an LLM judge + the calibration of an RM in a single speedy package (1/n)
4
15
77
Anyway, I'm not a lawyer and I make no claims about what should happen with AI and public data, but I hope this thread helped you understand a bit about what is happening. Further disclaimer: all of this is general information that hundreds, maybe thousands of people inside the
17
11
641
A consequence of this caring-about-rules spectrum is that there's a "compliance gap" in many evals. If you have an idea where a given company (or even academic lab) sits, you should be either more or less impressed by their numbers.
1
7
441
Legal limbo also means that both compliant and non-compliant actors look the same from the outside. The non-compliant say nothing about their data because they did violate rules, and the compliant say nothing because they only incur risk by saying anything.
1
14
560
This legal ambiguity gives rise to hilarious situations like technical leaders saying they "can't recall" what went into their models.
2
18
659
Second is that the law isn't settled here. This means that no one can safely say anything about their training data. They can't even say what they didn't train on because failing to say you didn't train on X could be construed as evidence that you did.
3
15
521
The compliant players have two problems. First is that you can't really get opt-in from the whole internet. Which means you can't get 100% contract coverage. Unless you're willing to not train on most of the internet, but that's throwing away a lot of useful data.
3
10
542
Paying more to cover their butts makes sense for them because: 1. It's worth it to reduce legal risk 2. Employees and leaders want to "do the right thing," or at least brand themselves as the good guys internally 3. It helps them acquire risk-averse corporate customers
2
10
697
The group that does care about compliance is more interesting. They strike deals with publishers and copyright holders whenever possible. They buy books. They carefully filter out content that might have been pirated. Etc.
3
10
667
So they train on anything they can get their hands on, including outputs from rival models (in violation of the terms of service).
2
9
639
The companies that don't care either aren't subject to these laws or are more worried about dying from irrelevance than lawsuits. Legal battles can drain you but not having a good enough product can kill you.
1
24
837
There are roughly two groups of actors: 1. Those that care about US + EU laws and regulations. 2. Those that don't. But both look the same from the outside.
6
25
908
While I briefly have no employer, let me tell you what's really happening with AI companies training on public data: [1/n]
71
339
4K
Paper is here: https://t.co/o8Erc1YdY3 If you like this thread, consider sharing it or following my awesome coauthors @saanarkethayan (with a banger of a first-ever first-author paper) @gupta__abhay and @mansiege. Happy to answer questions in the comments. [11/11]
arxiv.org
Large Language Model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging. It is currently...
7
10
179
(except that it was way harder than that, we eliminated a bunch of hparams, and we got it to work at scale + for all the linears instead of just some of them). [10/11]
1
0
48
tl;dr: we saw muParametrization and Unit Scaling and were like:
1
0
62
Our changes also let you transfer hyperparameters to orders-of-magnitude larger models. I.e., you can tune your hparams on small models and trust that they’ll be good for large models too. This saves *tons* of compute vs sweeping hparams on large models. [8/11]
3
2
86
Making all these changes lets you just straight up cast your weights to fp8 during training, with no dynamic rescaling or anything. Plus you get better convergence, even in bf16. [7/11]
1
0
69
Second, instead of just adding the residual branch output to the residual stream, we take a linear combination of the two. If both input tensors have unit norm and you make the squared coefficients sum to 1, the output tensor tends to have unit norm as well. [6/11]
1
0
68