Marc Finzi Profile
Marc Finzi

@m_finzi

Followers
1K
Following
797
Media
37
Statuses
162

OpenAI researcher. Previously postdoc at CMU and PhD at NYU. ex physics

San Francisco
Joined January 2017
Don't wanna be here? Send us removal request.
@m_finzi
Marc Finzi
3 months
Much more in the paper!.A big thanks to the team that made this possible.@psiyumm, @entropicfox, Anming Gu, @chrismdesa, @zicokolter, @andrewgwils.📚 Check it out:
0
0
6
@m_finzi
Marc Finzi
3 months
This lets us prove something striking:.The train-test gap must shrink as the models are scaled up. Our bounds grow predictably as a sum of power laws for each of the relevant contributions, an interpretable story about generalization! .6/7
Tweet media one
2
3
7
@m_finzi
Marc Finzi
3 months
For small models, quantization and parameter counting produces tighter estimates on the information content, but for >10B parameter models prequential coding produces tighter bounds on the information content. Thus C gets smaller as the models get bigger. 5/7
Tweet media one
1
0
5
@m_finzi
Marc Finzi
3 months
Our bounds cleanly decompose into a loss variation, quantization gap, and random guessing cost, all tied to a (compression) ratio C representing the ratio of information in the model to the number of tokens in the pre-training data. 4/7
Tweet media one
1
0
6
@m_finzi
Marc Finzi
3 months
The key? 🔑 LLMs compress their data far more than you'd expect. If the training loss drops quickly 📉, prequential coding shows that large models store far less information than raw parameter counts suggest. 3/7.
1
2
12
@m_finzi
Marc Finzi
3 months
Our findings align closely with the recent OpenAI GPT-4.5 pretraining team discussion, where Daniel Selsam explains how LLMs can produce intelligent behavior in light of their massive parameter counts:.2/7.
1
0
3
@m_finzi
Marc Finzi
3 months
Why do larger language models generalize better? . In our new ICLR paper, we derive an interpretable generalization bound showing that compute-optimal LLMs provably generalize better with scale! 📄1/7🧵.
3
31
128
@m_finzi
Marc Finzi
3 months
RT @ashertrockman: Are you a frontier lab investing untold sums in training? Are you trying to stay competitive? Are you finding that your….
0
29
0
@m_finzi
Marc Finzi
6 months
RT @dylanjsam: To trust LLMs in deployment (e.g., agentic frameworks or for generating synthetic data), we should predict how well they wil….
0
40
0
@m_finzi
Marc Finzi
1 year
We investigate other scientific questions about how structure in the data impacts the bounds, as well as how to make the evaluation scale. Check out @LotfiSanae's 🧵 and @micahgoldblum's 🧵 for details and additional insights!.9/9.
@micahgoldblum
Micah Goldblum
1 year
Do LLMs simply memorize and parrot their pretraining data or do they learn patterns that generalize? Let’s put this to the test! We compute the first generalization guarantees for LLMs. w/ @LotfiSanae, @m_finzi, @KuangYilun, @timrudner, @andrewgwils. 1/9.
0
0
4
@m_finzi
Marc Finzi
1 year
Intruigingly, we find that larger models achieve better generalization bounds as we scale them up, even if we hold the dataset size fixed! In other words, for large models we can find a more compressible set of parameters for the same training error. 8/9
Tweet media one
1
0
5
@m_finzi
Marc Finzi
1 year
By trading off the extent of compression we can find a compressed solution that achieves the best generalization bounds, and these bounds are nonvacuous!.7/9
Tweet media one
Tweet media two
1
0
3
@m_finzi
Marc Finzi
1 year
We can make the bounds tighter by compressing the model. We do so by training with SubLoRA, an expressive and efficient nonlinear parametrization of the underlying model parameters. SubLoRA combines random subspace training with LoRA to achieve extreme compression ratios. 6/9
Tweet media one
1
0
4
@m_finzi
Marc Finzi
1 year
In bounding the NLL we have a challenge in constructing the bounds, because the values can be arbitrarily bad. We limit the worst case behavior Δ by applying prediction smoothing, adding some amount of the uniform distribution over tokens to the distribution. 5/9
Tweet media one
Tweet media two
1
0
3
@m_finzi
Marc Finzi
1 year
For constructing bounds that are relevant to LLMs, we construct the generalization bounds on the average negative log likelihood per token for each document. By considering each document as a distinct data point rather than each token, we can fulfill the IID assumption. 4/9.
1
0
3
@m_finzi
Marc Finzi
1 year
No need for PAC-Bayes, data dependent priors or other sophisticated techniques. Instead we use a simple variant of the finite hypothesis generalization bound, where complexity log 1/P(h) is measured by the number of bits needed to express a given model h. 3/9
Tweet media one
1
0
3
@m_finzi
Marc Finzi
1 year
We believe that a pure simplicity bias is enough to explain a lot of the generalization behavior of LLMs. The Occam’s razor principle can be formalized through Kolmogorov complexity, models with low complexity and low training error are simple explanations of the data. 2/9.
1
0
3
@m_finzi
Marc Finzi
1 year
In this work we construct the first nonvacuous generalization bounds for LLMs, helping to explain why these models generalize. w/ @LotfiSanae, @KuangYilun, @timrudner @micahgoldblum, @andrewgwils. A 🧵on how we make these bounds.1/9.
1
8
72
@m_finzi
Marc Finzi
2 years
We provide examples applying CoLA to PDEs, optimization, equivariant nets, GPs, & more! If you’re looking to compute solves, eigs, sqrts, matrix exponentials, determinants, trace or diagonals of a structured matrix, CoLA is for you! Code:👉 🥤[7/7].
2
0
22
@m_finzi
Marc Finzi
2 years
The best algorithm depends on the structure of the matrix; for example, inverting a permutation can be done with mergesort. CoLA is designed with multiple dispatch, specialized algorithms to the matrix at hand, and these rules combine compositionally.🤖 [6/7]
Tweet media one
1
0
8