Sadhika Malladi
@SadhikaMalladi
Followers
2K
Following
738
Media
7
Statuses
264
Postdoc researcher at MSR NYC; incoming faculty at UCSD CSE; CS PhD at Princeton
Joined June 2022
Excited to share that I will be starting as an Assistant Professor in CSE at UCSD (@ucsd_cse) in Fall 2026! I am currently recruiting PhD students who want to bridge theory and practice in deep learning - see here:
39
71
543
Leaving aside the content of this story, thanks for finding this picture! Brings back old memories of my time in early OpenAI :)
OpenAI team and their families at a July 2019 offsite. Microsoft invested $1 billion that same month. Full story of the Microsoft / OpenAI deal live on @TBPN today.
0
1
76
BTW if you are interested in other failures of xent, do also check out our prior paper on how pre-training for longer hurts your ability to do SFT (ICML 25):
arxiv.org
Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge...
0
0
4
Key figure from our new paper: coverage is more predictive than KL of what model will succeed in best-of-N. Read more in Dylan's thread and at
arxiv.org
Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model...
@auddery @GolowichNoah @SadhikaMalladi @jordan_t_ash (7/12) Example (see figure): - Cross-entropy decreases throughout training. - Coverage improves to a point, but begins to drop as the model learns a spurious shortcut. - BoN performance follows trend of coverage, not CE (increasing initially, dropping as shortcut is learned).
0
6
22
Coverage is necessary + sufficient for best-of-N (and, thus, RL) to work, and it can (and does!) behave differently from cross-entropy. So: model with the lowest xent is not always the best for RL! Coverage view motivates, eg, test-time training.
arxiv.org
Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model...
The coverage principle: How pre-training enables post-training New preprint where we look at the mechanisms through which next-token prediction produces models that succeed at downstream tasks. The answer involves a metric we call the "coverage profile", not cross-entropy.
3
13
94
New paper we're excited to get online! Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking. A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.
9
42
247
Almost a decade ago, I coauthored a paper asking us to rethink our theory of generalization in machine learning. Today, Iโm fine putting the theory back on the shelf.
argmin.net
You don't need a theorem to argue more data is better than less data
7
24
192
Glad to see this super cool work out! Anticipating the natural Q of how this relates to the SDEs I did -- both eqns have deterministic flow + oscillation. SDE is for the widely used setting of mini-batch training and this is for full-batch. But SDE alone gives no prediction on if
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
0
2
18
Nice to see Thinking Machines engaging with some of the more theoretical academic literature, including our work from 2022: https://t.co/aSeXAN2Vff, on LoRA to support their very cool empirical findings!
arxiv.org
It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g.,...
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
0
2
9
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage โพ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
9
83
444
I've really enjoyed the new settings introduced in both papers ( https://t.co/zKGmKErhGQ and https://t.co/c2LsTIjQop) from these folks and think there's great inspiration for practical algorithms here!
arxiv.org
Aligning AI systems with human values remains a fundamental challenge, but does our inability to create perfectly aligned models preclude obtaining the benefits of alignment? We study a strategic...
Aligning an AI with human preferences might be hard. But there is more than one AI out there, and users can choose which to use. Can we get the benefits of a fully aligned AI without solving the alignment problem? In a new paper we study a setting in which the answer is yes.
2
8
52
In the meantime, I'm at Microsoft Research NYC, learning about RL theory and exploring new research ideas. I'm open to collaborations, and I'm based in the Bay Area, so please reach out if you want to grab a coffee and chat!
2
2
34
Quick update: We have extended the deadline for FoRLM to Monday, September 8, 2025 AoE. Good luck with your submissions!
Announcing the first workshop on Foundations of Language Model Reasoning (FoRLM) at NeurIPS 2025! ๐Soliciting abstracts that advance foundational understanding of reasoning in language models, from theoretical analyses to rigorous empirical studies. ๐ Deadline: Sept 3, 2025
0
6
29
We have a new workshop at NeurIPS 25 on understanding reasoning (via theory and/or exps)! Submit your work by Sep 3 and join us at the conference :)
Announcing the first workshop on Foundations of Language Model Reasoning (FoRLM) at NeurIPS 2025! ๐Soliciting abstracts that advance foundational understanding of reasoning in language models, from theoretical analyses to rigorous empirical studies. ๐ Deadline: Sept 3, 2025
0
2
21
We make a very intriguing observation that more pre-training data can worsen downstream performance. Come visit our poster at #ICML2025 on Thursday at East Exhibition Hall A-B #E-2508 to learn more about when and why we see such catastrophic overtraining.
Training with more data = better LLMs, right? ๐จ False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." ๐ฅ๐งต+arXiv ๐ 1/9
0
14
97
MeCo got into ICML! Our method improves data efficiency of LMs by 33% by simply prepending URLs! My amazing co-author @_awettig will present MeCo's poster at the following session: July 17 (Thursday) 11 am - 1:30pm PDT East Exhibition Hall A-B #E-2600 Come by and say hi!
Introducing MeCo (metadata conditioning then cooldown), a remarkably simple method that accelerates LM pre-training by simply prepending source URLs to training documents. https://t.co/46dtUUVb0P
0
6
56
Excited to be giving this talk at COLT tomorrow :) reach out if you want to chat about deriving useful theoretical insights into modern-day language models!
Announcing the first workshop on Foundations of Post-Training (FoPT) at COLT 2025! ๐ Soliciting abstracts/posters exploring theoretical & practical aspects of post-training and RL with language models! โ ๐๏ธ Deadline: May 19, 2025
0
6
41
Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? @gowerrobert and I found that it has to do with the beta parameters and variational inference.
11
50
317
Data selection and curriculum learning can be formally viewed as a compression protocol via prequential coding. New blog (with @AllanZhou17 ) about this neat idea that motivated ADO but didnโt make it into the paper. https://t.co/kkLyZN2CF7
yidingjiang.github.io
We describe a unified framework for data selection and curriculum learning via compression.
2
19
106