Marc Finzi
@m_finzi
Followers
1K
Following
847
Media
37
Statuses
163
OpenAI researcher. Previously postdoc at CMU and PhD at NYU. ex physics
San Francisco
Joined January 2017
Artificial Intelligence might be the last step in a 2000 year intellectual project: the naturalization of the mind as a dynamic mathematical object. This is the essence of the machine consciousness hypothesis.
75
108
937
Much more in the paper! A big thanks to the team that made this possible @psiyumm, @entropicfox, Anming Gu, @chrismdesa, @zicokolter, @andrewgwils 📚 Check it out:
arxiv.org
Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal...
0
0
6
This lets us prove something striking: The train-test gap must shrink as the models are scaled up. Our bounds grow predictably as a sum of power laws for each of the relevant contributions, an interpretable story about generalization! 6/7
2
3
7
For small models, quantization and parameter counting produces tighter estimates on the information content, but for >10B parameter models prequential coding produces tighter bounds on the information content. Thus C gets smaller as the models get bigger. 5/7
1
0
5
Our bounds cleanly decompose into a loss variation, quantization gap, and random guessing cost, all tied to a (compression) ratio C representing the ratio of information in the model to the number of tokens in the pre-training data. 4/7
1
0
6
The key? 🔑 LLMs compress their data far more than you'd expect. If the training loss drops quickly 📉, prequential coding shows that large models store far less information than raw parameter counts suggest. 3/7
1
2
13
Our findings align closely with the recent OpenAI GPT-4.5 pretraining team discussion, where Daniel Selsam explains how LLMs can produce intelligent behavior in light of their massive parameter counts: https://t.co/raeTOjnkEW 2/7
1
0
3
Why do larger language models generalize better? In our new ICLR paper, we derive an interpretable generalization bound showing that compute-optimal LLMs provably generalize better with scale! 📄 https://t.co/dtxHze6T8N 1/7🧵
arxiv.org
Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal...
3
33
131
Are you a frontier lab investing untold sums in training? Are you trying to stay competitive? Are you finding that your competitors' models are ... thinking a bit too much like yours? Then https://t.co/qwVitSQK6o might be for you! @sama @elonmusk
5
30
141
To trust LLMs in deployment (e.g., agentic frameworks or for generating synthetic data), we should predict how well they will perform. Our paper shows that we can do this by simply asking black-box models multiple follow-up questions! w/ @m_finzi and @zicokolter 1/ 🧵
4
42
118
We investigate other scientific questions about how structure in the data impacts the bounds, as well as how to make the evaluation scale. Check out @LotfiSanae's 🧵 https://t.co/HMVGZKVATH and @micahgoldblum's 🧵 https://t.co/WdqEtM9i4Z for details and additional insights! 9/9
Do LLMs simply memorize and parrot their pretraining data or do they learn patterns that generalize? Let’s put this to the test! We compute the first generalization guarantees for LLMs. w/ @LotfiSanae, @m_finzi, @KuangYilun, @timrudner, @andrewgwils
https://t.co/IToy3BcQjW 1/9
0
0
4
Intruigingly, we find that larger models achieve better generalization bounds as we scale them up, even if we hold the dataset size fixed! In other words, for large models we can find a more compressible set of parameters for the same training error. 8/9
1
0
5
By trading off the extent of compression we can find a compressed solution that achieves the best generalization bounds, and these bounds are nonvacuous! 7/9
1
0
3
We can make the bounds tighter by compressing the model. We do so by training with SubLoRA, an expressive and efficient nonlinear parametrization of the underlying model parameters. SubLoRA combines random subspace training with LoRA to achieve extreme compression ratios. 6/9
1
0
4
In bounding the NLL we have a challenge in constructing the bounds, because the values can be arbitrarily bad. We limit the worst case behavior Δ by applying prediction smoothing, adding some amount of the uniform distribution over tokens to the distribution. 5/9
1
0
3
For constructing bounds that are relevant to LLMs, we construct the generalization bounds on the average negative log likelihood per token for each document. By considering each document as a distinct data point rather than each token, we can fulfill the IID assumption. 4/9
1
0
3
No need for PAC-Bayes, data dependent priors or other sophisticated techniques. Instead we use a simple variant of the finite hypothesis generalization bound, where complexity log 1/P(h) is measured by the number of bits needed to express a given model h. 3/9
1
0
3
We believe that a pure simplicity bias is enough to explain a lot of the generalization behavior of LLMs. The Occam’s razor principle can be formalized through Kolmogorov complexity, models with low complexity and low training error are simple explanations of the data. 2/9
1
0
3
In this work we construct the first nonvacuous generalization bounds for LLMs, helping to explain why these models generalize. w/ @LotfiSanae, @KuangYilun, @timrudner @micahgoldblum, @andrewgwils
https://t.co/e0h4UmmpAF A 🧵on how we make these bounds 1/9
arxiv.org
Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply parrot their training corpora. We provide the...
1
9
71
We provide examples applying CoLA to PDEs, optimization, equivariant nets, GPs, & more! If you’re looking to compute solves, eigs, sqrts, matrix exponentials, determinants, trace or diagonals of a structured matrix, CoLA is for you! Code:👉 https://t.co/MDqp2LfL1i.🥤[7/7]
github.com
Compositional Linear Algebra. Contribute to wilson-labs/cola development by creating an account on GitHub.
2
0
21