Marc Finzi Profile
Marc Finzi

@m_finzi

Followers
1K
Following
847
Media
37
Statuses
163

OpenAI researcher. Previously postdoc at CMU and PhD at NYU. ex physics

San Francisco
Joined January 2017
Don't wanna be here? Send us removal request.
@Plinz
Joscha Bach
1 month
Artificial Intelligence might be the last step in a 2000 year intellectual project: the naturalization of the mind as a dynamic mathematical object. This is the essence of the machine consciousness hypothesis.
75
108
937
@m_finzi
Marc Finzi
6 months
This lets us prove something striking: The train-test gap must shrink as the models are scaled up. Our bounds grow predictably as a sum of power laws for each of the relevant contributions, an interpretable story about generalization! 6/7
2
3
7
@m_finzi
Marc Finzi
6 months
For small models, quantization and parameter counting produces tighter estimates on the information content, but for >10B parameter models prequential coding produces tighter bounds on the information content. Thus C gets smaller as the models get bigger. 5/7
1
0
5
@m_finzi
Marc Finzi
6 months
Our bounds cleanly decompose into a loss variation, quantization gap, and random guessing cost, all tied to a (compression) ratio C representing the ratio of information in the model to the number of tokens in the pre-training data. 4/7
1
0
6
@m_finzi
Marc Finzi
6 months
The key? 🔑 LLMs compress their data far more than you'd expect. If the training loss drops quickly 📉, prequential coding shows that large models store far less information than raw parameter counts suggest. 3/7
1
2
13
@m_finzi
Marc Finzi
6 months
Our findings align closely with the recent OpenAI GPT-4.5 pretraining team discussion, where Daniel Selsam explains how LLMs can produce intelligent behavior in light of their massive parameter counts: https://t.co/raeTOjnkEW 2/7
1
0
3
@m_finzi
Marc Finzi
6 months
Why do larger language models generalize better? In our new ICLR paper, we derive an interpretable generalization bound showing that compute-optimal LLMs provably generalize better with scale! 📄 https://t.co/dtxHze6T8N 1/7🧵
Tweet card summary image
arxiv.org
Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal...
3
33
131
@ashertrockman
Asher Trockman
6 months
Are you a frontier lab investing untold sums in training? Are you trying to stay competitive? Are you finding that your competitors' models are ... thinking a bit too much like yours? Then https://t.co/qwVitSQK6o might be for you! @sama @elonmusk
5
30
141
@dylanjsam
Dylan Sam
9 months
To trust LLMs in deployment (e.g., agentic frameworks or for generating synthetic data), we should predict how well they will perform. Our paper shows that we can do this by simply asking black-box models multiple follow-up questions! w/ @m_finzi and @zicokolter 1/ 🧵
4
42
118
@m_finzi
Marc Finzi
2 years
We investigate other scientific questions about how structure in the data impacts the bounds, as well as how to make the evaluation scale. Check out @LotfiSanae's 🧵 https://t.co/HMVGZKVATH and @micahgoldblum's 🧵 https://t.co/WdqEtM9i4Z for details and additional insights! 9/9
@micahgoldblum
Micah Goldblum
2 years
Do LLMs simply memorize and parrot their pretraining data or do they learn patterns that generalize? Let’s put this to the test! We compute the first generalization guarantees for LLMs. w/ @LotfiSanae, @m_finzi, @KuangYilun, @timrudner, @andrewgwils https://t.co/IToy3BcQjW 1/9
0
0
4
@m_finzi
Marc Finzi
2 years
Intruigingly, we find that larger models achieve better generalization bounds as we scale them up, even if we hold the dataset size fixed! In other words, for large models we can find a more compressible set of parameters for the same training error. 8/9
1
0
5
@m_finzi
Marc Finzi
2 years
By trading off the extent of compression we can find a compressed solution that achieves the best generalization bounds, and these bounds are nonvacuous! 7/9
1
0
3
@m_finzi
Marc Finzi
2 years
We can make the bounds tighter by compressing the model. We do so by training with SubLoRA, an expressive and efficient nonlinear parametrization of the underlying model parameters. SubLoRA combines random subspace training with LoRA to achieve extreme compression ratios. 6/9
1
0
4
@m_finzi
Marc Finzi
2 years
In bounding the NLL we have a challenge in constructing the bounds, because the values can be arbitrarily bad. We limit the worst case behavior Δ by applying prediction smoothing, adding some amount of the uniform distribution over tokens to the distribution. 5/9
1
0
3
@m_finzi
Marc Finzi
2 years
For constructing bounds that are relevant to LLMs, we construct the generalization bounds on the average negative log likelihood per token for each document. By considering each document as a distinct data point rather than each token, we can fulfill the IID assumption. 4/9
1
0
3
@m_finzi
Marc Finzi
2 years
No need for PAC-Bayes, data dependent priors or other sophisticated techniques. Instead we use a simple variant of the finite hypothesis generalization bound, where complexity log 1/P(h) is measured by the number of bits needed to express a given model h. 3/9
1
0
3
@m_finzi
Marc Finzi
2 years
We believe that a pure simplicity bias is enough to explain a lot of the generalization behavior of LLMs. The Occam’s razor principle can be formalized through Kolmogorov complexity, models with low complexity and low training error are simple explanations of the data. 2/9
1
0
3
@m_finzi
Marc Finzi
2 years
In this work we construct the first nonvacuous generalization bounds for LLMs, helping to explain why these models generalize. w/ @LotfiSanae, @KuangYilun, @timrudner @micahgoldblum, @andrewgwils https://t.co/e0h4UmmpAF A 🧵on how we make these bounds 1/9
Tweet card summary image
arxiv.org
Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply parrot their training corpora. We provide the...
1
9
71
@m_finzi
Marc Finzi
2 years
We provide examples applying CoLA to PDEs, optimization, equivariant nets, GPs, & more! If you’re looking to compute solves, eigs, sqrts, matrix exponentials, determinants, trace or diagonals of a structured matrix, CoLA is for you! Code:👉 https://t.co/MDqp2LfL1i.🥤[7/7]
Tweet card summary image
github.com
Compositional Linear Algebra. Contribute to wilson-labs/cola development by creating an account on GitHub.
2
0
21