Suhas Kotha @kothasuhas X Profile

Suhas Kotha

@kothasuhas

Followers

581

Following

480

Media

23

Statuses

85

cs phd @ stanford

https://t.co/kT6iprmMcx

Joined May 2020

Don't wanna be here? Send us removal request.

Suhas Kotha

@kothasuhas

2 months

Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute

9

83

444

Stanford NLP Group

@stanfordnlp

24 days

At the retreat, we’re hearing about the exciting work of a few of our current students: @JulieKallini, @JonSaadFalcon, @ShichengGLiu, @kothasuhas, …

2

10

74

Sally Zhu

@SallyHZhu

26 days

🔎Did someone steal your language model? We can tell you, as long as you shuffled your training data🔀. All we need is some text from their model! Concretely, suppose Alice trains an open-weight model and Bob uses it to produce text. Can Alice prove Bob used her model?🚨

34

93

758

Jacob Springer

@jacspringer

1 month

Does synthetic data always help text-embedder models? Not quite. The gains are sparse and come with trade-offs. We open-source data + code to make research on synthetic data for embeddings more rigorous. 1/

3

24

79

Jeremy Cohen

@deepcohen

2 months

Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.

19

213

1K

Haoran He

@tinner_he

2 months

🚨Our new paper: Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards We challenge the RL status quo. We find you don't need complex policy optimization for top-tier math reasoning. The key? Evaluating the Q function of a simple uniformly random policy. 🤯

10

40

254

Jingfeng Wu

@uuujingfeng

2 months

sharing a new paper w Peter Bartlett, @jasondeanlee, @ShamKakade6, Bin Yu ppl talking about implicit regularization, but how good is it? We show its surprisingly effective, that GD dominates ridge for all linear regression, w/ more cool stuff on GD vs SGD https://t.co/oAVKiVgUUQ

10

32

187

Zitong Yang

@ZitongYang0

2 months

📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵

9

49

248

Percy Liang

@percyliang

2 months

-2016 (classic era): focus on data efficiency 2017-2025 (pretraining era): focus on compute efficiency 2026-: focus on data efficiency (again) The standard Transformer paradigm is optimized for compute efficiency. As we look at data efficiency, we'll see very different design

Suhas Kotha

@kothasuhas

2 months

Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute

15

74

629

Suhas Kotha

@kothasuhas

2 months

This project wouldn’t have been possible without Marin! We share code to reproduce our runs and programmatically reconstruct every plot in the paper: https://t.co/tQguFP5Ig8 https://t.co/NZicIpA59x We also provide WandB reports/projects to access all of our 2000 runs tuning

github.com

Description Since internet data is growing slowly relative to compute, we're interested in finding pre-training algorithms that learn the most from a limited amount of data. To this end, we'...

1

2

26

Suhas Kotha

@kothasuhas

2 months

Thanks for reading! This is jointly led with my lovely co-author @konwookim and advisors @percyliang @tatsu_hashimoto Paper:

arxiv.org

Since compute grows much faster than web text available for language model pre-training, we ask how one should approach pre-training under fixed data and no compute constraints. We first show that...

1

3

32

Suhas Kotha

@kothasuhas

2 months

Though none of the individual interventions we consider are new and are instead inspired by classical statistics + data-constrained ML, they show that algorithmic improvements are critical to greater data efficiency in a compute-rich future. We believe that correctly

1

25

Suhas Kotha

@kothasuhas

2 months

Finally, we test our gains on downstream tasks, finding a 9% improvement on standard benchmarks at our scale. Moreover, when applying our interventions to math data from OctoThinker, we achieve 17.5x data efficiency.

1

2

20

Suhas Kotha

@kothasuhas

2 months

Are ♾ parameters necessary for data efficiency wins? Via distillation, we compress an 8-ensemble into a single model and retain most of the improvement. Furthermore, we find that simply training on self-generations with the exact same arch can actually improve performance

2

1

22

Suhas Kotha

@kothasuhas

2 months

We then test how our recipes scale to higher token counts. The baseline needs >1B tokens to match our best asymptote at 200M tokens, meaning 5.17x data efficiency. If the slopes and asymptotes of our data scaling laws are equal, there is a constant improvement across all scales

1

21

Suhas Kotha

@kothasuhas

2 months

We compose all of our interventions of epoching, regularization, parameter scaling, and ensemble scaling into a single recipe and take both N,K→♾️to further bring down loss. We estimate its best possible performance via the asymptote of a power law of asymptotes.

1

26

Suhas Kotha

@kothasuhas

2 months

However, scaling parameter count is only one possible recipe. We find that ensembling K independently trained models gives a lower loss asymptote. With enough compute, it is better to train multiple small models (e.g. two 300M) instead of a single larger model (e.g. one 600M)!

1

3

36

Suhas Kotha

@kothasuhas

2 months

Past methods of increasing epochs or params N overfit w/ a fixed number of web tokens After regularizing with much higher weight decay, we instead find loss follows a clean power law. The best possible loss is the limit as N→♾, which we estimate via the scaling law asymptote

2

0

27

Aditi Raghunathan

@AdtRaghunathan

2 months

There’s been a lot of work on unlearning in LLMs, trying to erase memorization without hurting capabilities — but we haven’t seen much success. ❓What if unlearning is actually doomed from the start? 👇This thread explains why and how *memorization sinks* offer a new way forward.

6

40

175

Sachin Goyal

@goyalsachin007

2 months

1/Excited to share the first in a series of my research updates on LLM pretraining🚀. Our new work shows *distilled pretraining*—increasingly used to train deployable models—has trade-offs: ✅ Boosts test-time scaling ⚠️ Weakens in-context learning ✨ Needs tailored data curation

5

67

332