Suhas Kotha
@kothasuhas
Followers
581
Following
480
Media
23
Statuses
85
cs phd @ stanford
Joined May 2020
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
9
83
444
At the retreat, we’re hearing about the exciting work of a few of our current students: @JulieKallini, @JonSaadFalcon, @ShichengGLiu, @kothasuhas, …
2
10
74
🔎Did someone steal your language model? We can tell you, as long as you shuffled your training data🔀. All we need is some text from their model! Concretely, suppose Alice trains an open-weight model and Bob uses it to produce text. Can Alice prove Bob used her model?🚨
34
93
758
Does synthetic data always help text-embedder models? Not quite. The gains are sparse and come with trade-offs. We open-source data + code to make research on synthetic data for embeddings more rigorous. 1/
3
24
79
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
19
213
1K
🚨Our new paper: Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards We challenge the RL status quo. We find you don't need complex policy optimization for top-tier math reasoning. The key? Evaluating the Q function of a simple uniformly random policy. 🤯
10
40
254
sharing a new paper w Peter Bartlett, @jasondeanlee, @ShamKakade6, Bin Yu ppl talking about implicit regularization, but how good is it? We show its surprisingly effective, that GD dominates ridge for all linear regression, w/ more cool stuff on GD vs SGD https://t.co/oAVKiVgUUQ
10
32
187
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
9
49
248
-2016 (classic era): focus on data efficiency 2017-2025 (pretraining era): focus on compute efficiency 2026-: focus on data efficiency (again) The standard Transformer paradigm is optimized for compute efficiency. As we look at data efficiency, we'll see very different design
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
15
74
629
This project wouldn’t have been possible without Marin! We share code to reproduce our runs and programmatically reconstruct every plot in the paper: https://t.co/tQguFP5Ig8
https://t.co/NZicIpA59x We also provide WandB reports/projects to access all of our 2000 runs tuning
github.com
Description Since internet data is growing slowly relative to compute, we're interested in finding pre-training algorithms that learn the most from a limited amount of data. To this end, we'...
1
2
26
Thanks for reading! This is jointly led with my lovely co-author @konwookim and advisors @percyliang @tatsu_hashimoto Paper:
arxiv.org
Since compute grows much faster than web text available for language model pre-training, we ask how one should approach pre-training under fixed data and no compute constraints. We first show that...
1
3
32
Though none of the individual interventions we consider are new and are instead inspired by classical statistics + data-constrained ML, they show that algorithmic improvements are critical to greater data efficiency in a compute-rich future. We believe that correctly
1
1
25
Finally, we test our gains on downstream tasks, finding a 9% improvement on standard benchmarks at our scale. Moreover, when applying our interventions to math data from OctoThinker, we achieve 17.5x data efficiency.
1
2
20
Are ♾ parameters necessary for data efficiency wins? Via distillation, we compress an 8-ensemble into a single model and retain most of the improvement. Furthermore, we find that simply training on self-generations with the exact same arch can actually improve performance
2
1
22
We then test how our recipes scale to higher token counts. The baseline needs >1B tokens to match our best asymptote at 200M tokens, meaning 5.17x data efficiency. If the slopes and asymptotes of our data scaling laws are equal, there is a constant improvement across all scales
1
1
21
We compose all of our interventions of epoching, regularization, parameter scaling, and ensemble scaling into a single recipe and take both N,K→♾️to further bring down loss. We estimate its best possible performance via the asymptote of a power law of asymptotes.
1
1
26
However, scaling parameter count is only one possible recipe. We find that ensembling K independently trained models gives a lower loss asymptote. With enough compute, it is better to train multiple small models (e.g. two 300M) instead of a single larger model (e.g. one 600M)!
1
3
36
Past methods of increasing epochs or params N overfit w/ a fixed number of web tokens After regularizing with much higher weight decay, we instead find loss follows a clean power law. The best possible loss is the limit as N→♾, which we estimate via the scaling law asymptote
2
0
27
There’s been a lot of work on unlearning in LLMs, trying to erase memorization without hurting capabilities — but we haven’t seen much success. ❓What if unlearning is actually doomed from the start? 👇This thread explains why and how *memorization sinks* offer a new way forward.
6
40
175
1/Excited to share the first in a series of my research updates on LLM pretraining🚀. Our new work shows *distilled pretraining*—increasingly used to train deployable models—has trade-offs: ✅ Boosts test-time scaling ⚠️ Weakens in-context learning ✨ Needs tailored data curation
5
67
332