Davis Blalock @davisblalock X Profile

Davis Blalock

@davisblalock

Followers

15K

Following

368

Media

494

Statuses

1K

Research scientist @GoogleDeepMind. Past: @Databricks, first hire @MosaicML, @MIT PhD. I post about AI technical progress + sometimes the business side.

https://t.co/xX7NIpsIVZ

San Francisco, CA

Joined December 2016

Don't wanna be here? Send us removal request.

Davis Blalock

@davisblalock

3 years

I've written about 500+ machine learning papers in the past year. Here are some of my most popular threads: [1/n]

14

72

402

Matthew Leavitt

@leavittron

3 months

The era of "The Era of Pretraining is Over" is over

Pratyush Maini

@pratyushmaini

3 months

1/Pretraining is hitting a data wall; scaling raw web data alone leads to diminishing returns. Today @datologyai shares BeyondWeb, our synthetic data approach & all the learnings from scaling it to trillions of tokens🧑🏼‍🍳 - 3B LLMs beat 8B models🚀 - Pareto frontier for performance

4

7

53

Jonathan Frankle

@jefrankle

3 months

Not that I have a favorite recent project, but... 🧵 LLM judges are the popular way to evaluate generative models. But they have drawbacks. They're: * Generative, so slow and expensive. * Nondeterministic. * Uncalibrated. They don't know how uncertain they are. Meet PGRM!

Alex Trott

@alexrtrott

3 months

Ever wonder what it'd look like if an LLM Judge and a Reward Model had a baby? So did we, which is why we created PGRM -- the Prompt-Guided Reward Model. TLDR: You get the instructability of an LLM judge + the calibration of an RM in a single speedy package (1/n)

4

15

77

Davis Blalock

@davisblalock

4 months

Anyway, I'm not a lawyer and I make no claims about what should happen with AI and public data, but I hope this thread helped you understand a bit about what is happening. Further disclaimer: all of this is general information that hundreds, maybe thousands of people inside the

17

11

641

Davis Blalock

@davisblalock

4 months

A consequence of this caring-about-rules spectrum is that there's a "compliance gap" in many evals. If you have an idea where a given company (or even academic lab) sits, you should be either more or less impressed by their numbers.

1

7

441

Davis Blalock

@davisblalock

4 months

Legal limbo also means that both compliant and non-compliant actors look the same from the outside. The non-compliant say nothing about their data because they did violate rules, and the compliant say nothing because they only incur risk by saying anything.

1

14

560

Davis Blalock

@davisblalock

4 months

This legal ambiguity gives rise to hilarious situations like technical leaders saying they "can't recall" what went into their models.

2

18

659

Davis Blalock

@davisblalock

4 months

Second is that the law isn't settled here. This means that no one can safely say anything about their training data. They can't even say what they didn't train on because failing to say you didn't train on X could be construed as evidence that you did.

3

15

521

Davis Blalock

@davisblalock

4 months

The compliant players have two problems. First is that you can't really get opt-in from the whole internet. Which means you can't get 100% contract coverage. Unless you're willing to not train on most of the internet, but that's throwing away a lot of useful data.

3

10

542

Davis Blalock

@davisblalock

4 months

Paying more to cover their butts makes sense for them because: 1. It's worth it to reduce legal risk 2. Employees and leaders want to "do the right thing," or at least brand themselves as the good guys internally 3. It helps them acquire risk-averse corporate customers

2

10

697

Davis Blalock

@davisblalock

4 months

The group that does care about compliance is more interesting. They strike deals with publishers and copyright holders whenever possible. They buy books. They carefully filter out content that might have been pirated. Etc.

3

10

667

Davis Blalock

@davisblalock

4 months

So they train on anything they can get their hands on, including outputs from rival models (in violation of the terms of service).

2

9

639

Davis Blalock

@davisblalock

4 months

The companies that don't care either aren't subject to these laws or are more worried about dying from irrelevance than lawsuits. Legal battles can drain you but not having a good enough product can kill you.

1

24

837

Davis Blalock

@davisblalock

4 months

There are roughly two groups of actors: 1. Those that care about US + EU laws and regulations. 2. Those that don't. But both look the same from the outside.

6

25

908

Davis Blalock

@davisblalock

4 months

While I briefly have no employer, let me tell you what's really happening with AI companies training on public data: [1/n]

71

339

4K

Davis Blalock

@davisblalock

4 months

Paper is here: https://t.co/o8Erc1YdY3 If you like this thread, consider sharing it or following my awesome coauthors @saanarkethayan (with a banger of a first-ever first-author paper) @gupta__abhay and @mansiege. Happy to answer questions in the comments. [11/11]

arxiv.org

Large Language Model training with 8-bit floating point (FP8) formats promises significant efficiency improvements, but reduced numerical precision makes training challenging. It is currently...

7

10

179

Davis Blalock

@davisblalock

4 months

(except that it was way harder than that, we eliminated a bunch of hparams, and we got it to work at scale + for all the linears instead of just some of them). [10/11]

1

0

48

Davis Blalock

@davisblalock

4 months

tl;dr: we saw muParametrization and Unit Scaling and were like:

1

0

62

Davis Blalock

@davisblalock

4 months

Our changes also let you transfer hyperparameters to orders-of-magnitude larger models. I.e., you can tune your hparams on small models and trust that they’ll be good for large models too. This saves *tons* of compute vs sweeping hparams on large models. [8/11]

3

2

86

Davis Blalock

@davisblalock

4 months

Making all these changes lets you just straight up cast your weights to fp8 during training, with no dynamic rescaling or anything. Plus you get better convergence, even in bf16. [7/11]

1

0

69

Davis Blalock

@davisblalock

4 months

Second, instead of just adding the residual branch output to the residual stream, we take a linear combination of the two. If both input tensors have unit norm and you make the squared coefficients sum to 1, the output tensor tends to have unit norm as well. [6/11]

1

0

68