kothasuhas Profile Banner
Suhas Kotha Profile
Suhas Kotha

@kothasuhas

Followers
581
Following
480
Media
23
Statuses
85

cs phd @ stanford

Joined May 2020
Don't wanna be here? Send us removal request.
@kothasuhas
Suhas Kotha
2 months
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
9
83
444
@stanfordnlp
Stanford NLP Group
24 days
At the retreat, we’re hearing about the exciting work of a few of our current students: @JulieKallini, @JonSaadFalcon, @ShichengGLiu, @kothasuhas, …
2
10
74
@SallyHZhu
Sally Zhu
26 days
🔎Did someone steal your language model? We can tell you, as long as you shuffled your training data🔀. All we need is some text from their model! Concretely, suppose Alice trains an open-weight model and Bob uses it to produce text. Can Alice prove Bob used her model?🚨
34
93
758
@jacspringer
Jacob Springer
1 month
Does synthetic data always help text-embedder models? Not quite. The gains are sparse and come with trade-offs. We open-source data + code to make research on synthetic data for embeddings more rigorous. 1/
3
24
79
@deepcohen
Jeremy Cohen
2 months
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
19
213
1K
@tinner_he
Haoran He
2 months
🚨Our new paper: Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards We challenge the RL status quo. We find you don't need complex policy optimization for top-tier math reasoning. The key? Evaluating the Q function of a simple uniformly random policy. 🤯
10
40
254
@uuujingfeng
Jingfeng Wu
2 months
sharing a new paper w Peter Bartlett, @jasondeanlee, @ShamKakade6, Bin Yu ppl talking about implicit regularization, but how good is it? We show its surprisingly effective, that GD dominates ridge for all linear regression, w/ more cool stuff on GD vs SGD https://t.co/oAVKiVgUUQ
10
32
187
@ZitongYang0
Zitong Yang
2 months
📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵
9
49
248
@percyliang
Percy Liang
2 months
-2016 (classic era): focus on data efficiency 2017-2025 (pretraining era): focus on compute efficiency 2026-: focus on data efficiency (again) The standard Transformer paradigm is optimized for compute efficiency. As we look at data efficiency, we'll see very different design
@kothasuhas
Suhas Kotha
2 months
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
15
74
629
@kothasuhas
Suhas Kotha
2 months
This project wouldn’t have been possible without Marin! We share code to reproduce our runs and programmatically reconstruct every plot in the paper: https://t.co/tQguFP5Ig8 https://t.co/NZicIpA59x We also provide WandB reports/projects to access all of our 2000 runs tuning
Tweet card summary image
github.com
Description Since internet data is growing slowly relative to compute, we're interested in finding pre-training algorithms that learn the most from a limited amount of data. To this end, we'...
1
2
26
@kothasuhas
Suhas Kotha
2 months
Though none of the individual interventions we consider are new and are instead inspired by classical statistics + data-constrained ML, they show that algorithmic improvements are critical to greater data efficiency in a compute-rich future. We believe that correctly
1
1
25
@kothasuhas
Suhas Kotha
2 months
Finally, we test our gains on downstream tasks, finding a 9% improvement on standard benchmarks at our scale. Moreover, when applying our interventions to math data from OctoThinker, we achieve 17.5x data efficiency.
1
2
20
@kothasuhas
Suhas Kotha
2 months
Are ♾ parameters necessary for data efficiency wins? Via distillation, we compress an 8-ensemble into a single model and retain most of the improvement. Furthermore, we find that simply training on self-generations with the exact same arch can actually improve performance
2
1
22
@kothasuhas
Suhas Kotha
2 months
We then test how our recipes scale to higher token counts. The baseline needs >1B tokens to match our best asymptote at 200M tokens, meaning 5.17x data efficiency. If the slopes and asymptotes of our data scaling laws are equal, there is a constant improvement across all scales
1
1
21
@kothasuhas
Suhas Kotha
2 months
We compose all of our interventions of epoching, regularization, parameter scaling, and ensemble scaling into a single recipe and take both N,K→♾️to further bring down loss. We estimate its best possible performance via the asymptote of a power law of asymptotes.
1
1
26
@kothasuhas
Suhas Kotha
2 months
However, scaling parameter count is only one possible recipe. We find that ensembling K independently trained models gives a lower loss asymptote. With enough compute, it is better to train multiple small models (e.g. two 300M) instead of a single larger model (e.g. one 600M)!
1
3
36
@kothasuhas
Suhas Kotha
2 months
Past methods of increasing epochs or params N overfit w/ a fixed number of web tokens After regularizing with much higher weight decay, we instead find loss follows a clean power law. The best possible loss is the limit as N→♾, which we estimate via the scaling law asymptote
2
0
27
@AdtRaghunathan
Aditi Raghunathan
2 months
There’s been a lot of work on unlearning in LLMs, trying to erase memorization without hurting capabilities — but we haven’t seen much success. ❓What if unlearning is actually doomed from the start? 👇This thread explains why and how *memorization sinks* offer a new way forward.
6
40
175
@goyalsachin007
Sachin Goyal
2 months
1/Excited to share the first in a series of my research updates on LLM pretraining🚀. Our new work shows *distilled pretraining*—increasingly used to train deployable models—has trade-offs: ✅ Boosts test-time scaling ⚠️ Weakens in-context learning ✨ Needs tailored data curation
5
67
332