Marc Finzi @m_finzi X Profile

Marc Finzi

@m_finzi

Followers

1K

Following

797

Media

37

Statuses

162

OpenAI researcher. Previously postdoc at CMU and PhD at NYU. ex physics

San Francisco

Joined January 2017

Don't wanna be here? Send us removal request.

Marc Finzi

@m_finzi

3 months

Much more in the paper!.A big thanks to the team that made this possible.@psiyumm, @entropicfox, Anming Gu, @chrismdesa, @zicokolter, @andrewgwils.📚 Check it out:

0

6

Marc Finzi

@m_finzi

3 months

This lets us prove something striking:.The train-test gap must shrink as the models are scaled up. Our bounds grow predictably as a sum of power laws for each of the relevant contributions, an interpretable story about generalization! .6/7

2

3

7

Marc Finzi

@m_finzi

3 months

For small models, quantization and parameter counting produces tighter estimates on the information content, but for >10B parameter models prequential coding produces tighter bounds on the information content. Thus C gets smaller as the models get bigger. 5/7

1

0

5

Marc Finzi

@m_finzi

3 months

Our bounds cleanly decompose into a loss variation, quantization gap, and random guessing cost, all tied to a (compression) ratio C representing the ratio of information in the model to the number of tokens in the pre-training data. 4/7

1

0

6

Marc Finzi

@m_finzi

3 months

The key? 🔑 LLMs compress their data far more than you'd expect. If the training loss drops quickly 📉, prequential coding shows that large models store far less information than raw parameter counts suggest. 3/7.

1

2

12

Marc Finzi

@m_finzi

3 months

Our findings align closely with the recent OpenAI GPT-4.5 pretraining team discussion, where Daniel Selsam explains how LLMs can produce intelligent behavior in light of their massive parameter counts:.2/7.

1

0

3

Marc Finzi

@m_finzi

3 months

Why do larger language models generalize better? . In our new ICLR paper, we derive an interpretable generalization bound showing that compute-optimal LLMs provably generalize better with scale! 📄1/7🧵.

3

31

128

Marc Finzi

@m_finzi

3 months

RT @ashertrockman: Are you a frontier lab investing untold sums in training? Are you trying to stay competitive? Are you finding that your….

0

29

0

Marc Finzi

@m_finzi

6 months

RT @dylanjsam: To trust LLMs in deployment (e.g., agentic frameworks or for generating synthetic data), we should predict how well they wil….

0

40

0

Marc Finzi

@m_finzi

1 year

We investigate other scientific questions about how structure in the data impacts the bounds, as well as how to make the evaluation scale. Check out @LotfiSanae's 🧵 and @micahgoldblum's 🧵 for details and additional insights!.9/9.

Micah Goldblum

@micahgoldblum

1 year

Do LLMs simply memorize and parrot their pretraining data or do they learn patterns that generalize? Let’s put this to the test! We compute the first generalization guarantees for LLMs. w/ @LotfiSanae, @m_finzi, @KuangYilun, @timrudner, @andrewgwils. 1/9.

0

4

Marc Finzi

@m_finzi

1 year

Intruigingly, we find that larger models achieve better generalization bounds as we scale them up, even if we hold the dataset size fixed! In other words, for large models we can find a more compressible set of parameters for the same training error. 8/9

1

0

5

Marc Finzi

@m_finzi

1 year

By trading off the extent of compression we can find a compressed solution that achieves the best generalization bounds, and these bounds are nonvacuous!.7/9

1

0

3

Marc Finzi

@m_finzi

1 year

We can make the bounds tighter by compressing the model. We do so by training with SubLoRA, an expressive and efficient nonlinear parametrization of the underlying model parameters. SubLoRA combines random subspace training with LoRA to achieve extreme compression ratios. 6/9

1

0

4

Marc Finzi

@m_finzi

1 year

In bounding the NLL we have a challenge in constructing the bounds, because the values can be arbitrarily bad. We limit the worst case behavior Δ by applying prediction smoothing, adding some amount of the uniform distribution over tokens to the distribution. 5/9

1

0

3

Marc Finzi

@m_finzi

1 year

For constructing bounds that are relevant to LLMs, we construct the generalization bounds on the average negative log likelihood per token for each document. By considering each document as a distinct data point rather than each token, we can fulfill the IID assumption. 4/9.

1

0

3

Marc Finzi

@m_finzi

1 year

No need for PAC-Bayes, data dependent priors or other sophisticated techniques. Instead we use a simple variant of the finite hypothesis generalization bound, where complexity log 1/P(h) is measured by the number of bits needed to express a given model h. 3/9

1

0

3

Marc Finzi

@m_finzi

1 year

We believe that a pure simplicity bias is enough to explain a lot of the generalization behavior of LLMs. The Occam’s razor principle can be formalized through Kolmogorov complexity, models with low complexity and low training error are simple explanations of the data. 2/9.

1

0

3

Marc Finzi

@m_finzi

1 year

In this work we construct the first nonvacuous generalization bounds for LLMs, helping to explain why these models generalize. w/ @LotfiSanae, @KuangYilun, @timrudner @micahgoldblum, @andrewgwils. A 🧵on how we make these bounds.1/9.

1

8

72

Marc Finzi

@m_finzi

2 years

We provide examples applying CoLA to PDEs, optimization, equivariant nets, GPs, & more! If you’re looking to compute solves, eigs, sqrts, matrix exponentials, determinants, trace or diagonals of a structured matrix, CoLA is for you! Code:👉 🥤[7/7].

2

0

22

Marc Finzi

@m_finzi

2 years

The best algorithm depends on the structure of the matrix; for example, inverting a permutation can be done with mergesort. CoLA is designed with multiple dispatch, specialized algorithms to the matrix at hand, and these rules combine compositionally.🤖 [6/7]

1

0

8