Sadhika Malladi Profile
Sadhika Malladi

@SadhikaMalladi

Followers
2K
Following
738
Media
7
Statuses
264

Postdoc researcher at MSR NYC; incoming faculty at UCSD CSE; CS PhD at Princeton

Joined June 2022
Don't wanna be here? Send us removal request.
@SadhikaMalladi
Sadhika Malladi
2 months
Excited to share that I will be starting as an Assistant Professor in CSE at UCSD (@ucsd_cse) in Fall 2026! I am currently recruiting PhD students who want to bridge theory and practice in deep learning - see here:
39
71
543
@SadhikaMalladi
Sadhika Malladi
25 days
Leaving aside the content of this story, thanks for finding this picture! Brings back old memories of my time in early OpenAI :)
@johncoogan
John Coogan
25 days
OpenAI team and their families at a July 2019 offsite. Microsoft invested $1 billion that same month. Full story of the Microsoft / OpenAI deal live on @TBPN today.
0
1
76
@SadhikaMalladi
Sadhika Malladi
29 days
BTW if you are interested in other failures of xent, do also check out our prior paper on how pre-training for longer hurts your ability to do SFT (ICML 25):
Tweet card summary image
arxiv.org
Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge...
0
0
4
@SadhikaMalladi
Sadhika Malladi
29 days
Key figure from our new paper: coverage is more predictive than KL of what model will succeed in best-of-N. Read more in Dylan's thread and at
Tweet card summary image
arxiv.org
Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model...
@canondetortugas
Dylan Foster ๐Ÿข
1 month
@auddery @GolowichNoah @SadhikaMalladi @jordan_t_ash (7/12) Example (see figure): - Cross-entropy decreases throughout training. - Coverage improves to a point, but begins to drop as the model learns a spurious shortcut. - BoN performance follows trend of coverage, not CE (increasing initially, dropping as shortcut is learned).
0
6
22
@SadhikaMalladi
Sadhika Malladi
29 days
Coverage is necessary + sufficient for best-of-N (and, thus, RL) to work, and it can (and does!) behave differently from cross-entropy. So: model with the lowest xent is not always the best for RL! Coverage view motivates, eg, test-time training.
Tweet card summary image
arxiv.org
Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model...
@canondetortugas
Dylan Foster ๐Ÿข
1 month
The coverage principle: How pre-training enables post-training New preprint where we look at the mechanisms through which next-token prediction produces models that succeed at downstream tasks. The answer involves a metric we call the "coverage profile", not cross-entropy.
3
13
94
@canondetortugas
Dylan Foster ๐Ÿข
1 month
New paper we're excited to get online! Taming Imperfect Process Verifiers: A Sampling Perspective on Backtracking. A totally new framework based on ~backtracking~ for using process verifiers to guide inference, w/ connections to approximate counting/sampling in theoretical CS.
9
42
247
@beenwrekt
Ben Recht
1 month
Almost a decade ago, I coauthored a paper asking us to rethink our theory of generalization in machine learning. Today, Iโ€™m fine putting the theory back on the shelf.
Tweet card summary image
argmin.net
You don't need a theorem to argue more data is better than less data
7
24
192
@SadhikaMalladi
Sadhika Malladi
2 months
Glad to see this super cool work out! Anticipating the natural Q of how this relates to the SDEs I did -- both eqns have deterministic flow + oscillation. SDE is for the widely used setting of mini-batch training and this is for full-batch. But SDE alone gives no prediction on if
@deepcohen
Jeremy Cohen
2 months
Even with full-batch gradients, DL optimizers defy classical optimization theory, as they operate at the *edge of stability.* With @alex_damian_, we introduce "central flows": a theoretical tool to analyze these dynamics that makes accurate quantitative predictions on real NNs.
0
2
18
@SadhikaMalladi
Sadhika Malladi
2 months
Nice to see Thinking Machines engaging with some of the more theoretical academic literature, including our work from 2022: https://t.co/aSeXAN2Vff, on LoRA to support their very cool empirical findings!
Tweet card summary image
arxiv.org
It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g.,...
@thinkymachines
Thinking Machines
2 months
LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
0
2
9
@kothasuhas
Suhas Kotha
2 months
Since compute grows faster than the web, we think the future of pre-training lies in the algorithms that will best leverage โ™พ compute We find simple recipes that improve the asymptote of compute scaling laws to be 5x data efficient, offering better perf w/ sufficient compute
9
83
444
@SadhikaMalladi
Sadhika Malladi
2 months
I've really enjoyed the new settings introduced in both papers ( https://t.co/zKGmKErhGQ and https://t.co/c2LsTIjQop) from these folks and think there's great inspiration for practical algorithms here!
Tweet card summary image
arxiv.org
Aligning AI systems with human values remains a fundamental challenge, but does our inability to create perfectly aligned models preclude obtaining the benefits of alignment? We study a strategic...
@Aaroth
Aaron Roth
2 months
Aligning an AI with human preferences might be hard. But there is more than one AI out there, and users can choose which to use. Can we get the benefits of a fully aligned AI without solving the alignment problem? In a new paper we study a setting in which the answer is yes.
2
8
52
@SadhikaMalladi
Sadhika Malladi
2 months
In the meantime, I'm at Microsoft Research NYC, learning about RL theory and exploring new research ideas. I'm open to collaborations, and I'm based in the Bay Area, so please reach out if you want to grab a coffee and chat!
2
2
34
@canondetortugas
Dylan Foster ๐Ÿข
3 months
Quick update: We have extended the deadline for FoRLM to Monday, September 8, 2025 AoE. Good luck with your submissions!
@canondetortugas
Dylan Foster ๐Ÿข
3 months
Announcing the first workshop on Foundations of Language Model Reasoning (FoRLM) at NeurIPS 2025! ๐Ÿ“Soliciting abstracts that advance foundational understanding of reasoning in language models, from theoretical analyses to rigorous empirical studies. ๐Ÿ“† Deadline: Sept 3, 2025
0
6
29
@SadhikaMalladi
Sadhika Malladi
3 months
We have a new workshop at NeurIPS 25 on understanding reasoning (via theory and/or exps)! Submit your work by Sep 3 and join us at the conference :)
@canondetortugas
Dylan Foster ๐Ÿข
3 months
Announcing the first workshop on Foundations of Language Model Reasoning (FoRLM) at NeurIPS 2025! ๐Ÿ“Soliciting abstracts that advance foundational understanding of reasoning in language models, from theoretical analyses to rigorous empirical studies. ๐Ÿ“† Deadline: Sept 3, 2025
0
2
21
@AdtRaghunathan
Aditi Raghunathan
4 months
We make a very intriguing observation that more pre-training data can worsen downstream performance. Come visit our poster at #ICML2025 on Thursday at East Exhibition Hall A-B #E-2508 to learn more about when and why we see such catastrophic overtraining.
@jacspringer
Jacob Springer
8 months
Training with more data = better LLMs, right? ๐Ÿšจ False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." ๐Ÿฅ๐Ÿงต+arXiv ๐Ÿ‘‡ 1/9
0
14
97
@gaotianyu1350
Tianyu Gao
4 months
MeCo got into ICML! Our method improves data efficiency of LMs by 33% by simply prepending URLs! My amazing co-author @_awettig will present MeCo's poster at the following session: July 17 (Thursday) 11 am - 1:30pm PDT East Exhibition Hall A-B #E-2600 Come by and say hi!
@gaotianyu1350
Tianyu Gao
11 months
Introducing MeCo (metadata conditioning then cooldown), a remarkably simple method that accelerates LM pre-training by simply prepending source URLs to training documents. https://t.co/46dtUUVb0P
0
6
56
@SadhikaMalladi
Sadhika Malladi
5 months
Excited to be giving this talk at COLT tomorrow :) reach out if you want to chat about deriving useful theoretical insights into modern-day language models!
@Nived_Rajaraman
Nived Rajaraman
7 months
Announcing the first workshop on Foundations of Post-Training (FoPT) at COLT 2025! ๐Ÿ“ Soliciting abstracts/posters exploring theoretical & practical aspects of post-training and RL with language models! โ”‚ ๐Ÿ—“๏ธ Deadline: May 19, 2025
0
6
41
@orvieto_antonio
Antonio Orvieto
6 months
Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? @gowerrobert and I found that it has to do with the beta parameters and variational inference.
11
50
317
@yidingjiang
Yiding Jiang
6 months
Data selection and curriculum learning can be formally viewed as a compression protocol via prequential coding. New blog (with @AllanZhou17 ) about this neat idea that motivated ADO but didnโ€™t make it into the paper. https://t.co/kkLyZN2CF7
yidingjiang.github.io
We describe a unified framework for data selection and curriculum learning via compression.
2
19
106