Sadhika Malladi @SadhikaMalladi X Profile

Sadhika Malladi

@SadhikaMalladi

Followers

2K

Following

738

Media

7

Statuses

264

Postdoc researcher at MSR NYC; incoming faculty at UCSD CSE; CS PhD at Princeton

https://t.co/U7ML52N1mW

Joined June 2022

Don't wanna be here? Send us removal request.

Sadhika Malladi

@SadhikaMalladi

2 months

Excited to share that I will be starting as an Assistant Professor in CSE at UCSD (@ucsd_cse) in Fall 2026! I am currently recruiting PhD students who want to bridge theory and practice in deep learning - see here:

39

71

543

Sadhika Malladi

@SadhikaMalladi

25 days

Leaving aside the content of this story, thanks for finding this picture! Brings back old memories of my time in early OpenAI :)

John Coogan

@johncoogan

25 days

OpenAI team and their families at a July 2019 offsite. Microsoft invested $1 billion that same month. Full story of the Microsoft / OpenAI deal live on @TBPN today.

0

1

76

Sadhika Malladi

@SadhikaMalladi

29 days

BTW if you are interested in other failures of xent, do also check out our prior paper on how pre-training for longer hurts your ability to do SFT (ICML 25):

arxiv.org

Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge...

0

4

Sadhika Malladi

@SadhikaMalladi

29 days

Key figure from our new paper: coverage is more predictive than KL of what model will succeed in best-of-N. Read more in Dylan's thread and at

arxiv.org

Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model...

Dylan Foster 🐢

@canondetortugas

1 month

@auddery @GolowichNoah @SadhikaMalladi @jordan_t_ash (7/12) Example (see figure): - Cross-entropy decreases throughout training. - Coverage improves to a point, but begins to drop as the model learns a spurious shortcut. - BoN performance follows trend of coverage, not CE (increasing initially, dropping as shortcut is learned).

0

6

22

Sadhika Malladi

@SadhikaMalladi

29 days

Coverage is necessary + sufficient for best-of-N (and, thus, RL) to work, and it can (and does!) behave differently from cross-entropy. So: model with the lowest xent is not always the best for RL! Coverage view motivates, eg, test-time training.

arxiv.org

Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model...

Dylan Foster 🐢

@canondetortugas

1 month

The coverage principle: How pre-training enables post-training New preprint where we look at the mechanisms through which next-token prediction produces models that succeed at downstream tasks. The answer involves a metric we call the "coverage profile", not cross-entropy.

3

13

94

Dylan Foster 🐢

@canondetortugas

1 month

New paper we're excited to get online! Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking. A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.

9

42

247

Ben Recht

@beenwrekt

1 month

Almost a decade ago, I coauthored a paper asking us to rethink our theory of generalization in machine learning. Today, I’m fine putting the theory back on the shelf.

argmin.net

You don't need a theorem to argue more data is better than less data

7

24

192

Sadhika Malladi

@SadhikaMalladi

2 months

Glad to see this super cool work out! Anticipating the natural Q of how this relates to the SDEs I did -- both eqns have deterministic flow + oscillation. SDE is for the widely used setting of mini-batch training and this is for full-batch. But SDE alone gives no prediction on if

Jeremy Cohen

@deepcohen

2 months

Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.

0

2

18

Sadhika Malladi

@SadhikaMalladi

2 months

Nice to see Thinking Machines engaging with some of the more theoretical academic literature, including our work from 2022: https://t.co/aSeXAN2Vff, on LoRA to support their very cool empirical findings!

arxiv.org

It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g.,...

Thinking Machines

@thinkymachines

2 months

LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.

0

2

9

Suhas Kotha

@kothasuhas

2 months

Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage ♾ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute

9

83

444

Sadhika Malladi

@SadhikaMalladi

2 months

I've really enjoyed the new settings introduced in both papers ( https://t.co/zKGmKErhGQ and https://t.co/c2LsTIjQop) from these folks and think there's great inspiration for practical algorithms here!

arxiv.org

Aligning AI systems with human values remains a fundamental challenge, but does our inability to create perfectly aligned models preclude obtaining the benefits of alignment? We study a strategic...

Aaron Roth

@Aaroth

2 months

Aligning an AI with human preferences might be hard. But there is more than one AI out there, and users can choose which to use. Can we get the benefits of a fully aligned AI without solving the alignment problem? In a new paper we study a setting in which the answer is yes.

2

8

52

Sadhika Malladi

@SadhikaMalladi

2 months

In the meantime, I'm at Microsoft Research NYC, learning about RL theory and exploring new research ideas. I'm open to collaborations, and I'm based in the Bay Area, so please reach out if you want to grab a coffee and chat!

2

34

Dylan Foster 🐢

@canondetortugas

3 months

Quick update: We have extended the deadline for FoRLM to Monday, September 8, 2025 AoE. Good luck with your submissions!

Dylan Foster 🐢

@canondetortugas

3 months

Announcing the first workshop on Foundations of Language Model Reasoning (FoRLM) at NeurIPS 2025! 📝Soliciting abstracts that advance foundational understanding of reasoning in language models, from theoretical analyses to rigorous empirical studies. 📆 Deadline: Sept 3, 2025

0

6

29

Sadhika Malladi

@SadhikaMalladi

3 months

We have a new workshop at NeurIPS 25 on understanding reasoning (via theory and/or exps)! Submit your work by Sep 3 and join us at the conference :)

Dylan Foster 🐢

@canondetortugas

3 months

Announcing the first workshop on Foundations of Language Model Reasoning (FoRLM) at NeurIPS 2025! 📝Soliciting abstracts that advance foundational understanding of reasoning in language models, from theoretical analyses to rigorous empirical studies. 📆 Deadline: Sept 3, 2025

0

2

21

Aditi Raghunathan

@AdtRaghunathan

4 months

We make a very intriguing observation that more pre-training data can worsen downstream performance. Come visit our poster at #ICML2025 on Thursday at East Exhibition Hall A-B #E-2508 to learn more about when and why we see such catastrophic overtraining.

Jacob Springer

@jacspringer

8 months

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

0

14

97

Tianyu Gao

@gaotianyu1350

4 months

MeCo got into ICML! Our method improves data efficiency of LMs by 33% by simply prepending URLs! My amazing co-author @_awettig will present MeCo's poster at the following session: July 17 (Thursday) 11 am - 1:30pm PDT East Exhibition Hall A-B #E-2600 Come by and say hi!

Tianyu Gao

@gaotianyu1350

11 months

Introducing MeCo (metadata conditioning then cooldown), a remarkably simple method that accelerates LM pre-training by simply prepending source URLs to training documents. https://t.co/46dtUUVb0P

0

6

56

Sadhika Malladi

@SadhikaMalladi

5 months

Excited to be giving this talk at COLT tomorrow :) reach out if you want to chat about deriving useful theoretical insights into modern-day language models!

Nived Rajaraman

@Nived_Rajaraman

7 months

Announcing the first workshop on Foundations of Post-Training (FoPT) at COLT 2025! 📝 Soliciting abstracts/posters exploring theoretical & practical aspects of post-training and RL with language models! │ 🗓️ Deadline: May 19, 2025

0

6

41

Antonio Orvieto

@orvieto_antonio

6 months

Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? @gowerrobert and I found that it has to do with the beta parameters and variational inference.

11

50

317

Yiding Jiang

@yidingjiang

6 months

Data selection and curriculum learning can be formally viewed as a compression protocol via prequential coding. New blog (with @AllanZhou17 ) about this neat idea that motivated ADO but didn’t make it into the paper. https://t.co/kkLyZN2CF7

yidingjiang.github.io

We describe a unified framework for data selection and curriculum learning via compression.

2

19

106