Michael Hu
@michahu8
Followers
857
Following
3K
Media
18
Statuses
133
NLP, language models | PhD @NYU | @NSF GRFP fellow | prev @microsoft, @princeton_nlp, @cocosci_lab.
New York, NY
Joined August 2019
📝 New paper! Two strategies have emerged for controlling LLM behavior at inference time: in-context learning (ICL; i.e. prompting) and activation steering. We propose that both can be understood as altering model beliefs, formally in the sense of Bayesian belief updating. 1/9
8
21
125
i'm getting really tired of "it's not x, it's y" and "do x, not y" in peoples' writing (more accurately, in their ghostwriter's writing). when i see it i honestly just move on and try to delete what i saw from my memory it's not just sloppy, it's also tasteless
1
0
8
this work was done during my summer internship at microsoft with the wonderful @ben_vandurme @jacobandreas @harsh_jhamtani ❤️ fyi, the team is hiring summer interns for 2026 👀 https://t.co/JJ43MFzOFf
Summer '26 PhD research internships at Microsoft Copilot Tuning. Continual learning, complex reasoning and retrieval, nl2code, data efficient post-training. https://t.co/HM4cKqEhgW
0
0
3
check out our code for xminigrid-stateful, our language-only environment for testing online learning in LLM agents, which you can find here: https://t.co/HCw6MMfECL
github.com
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting - michahu/echo
1
1
4
convert your LLM agent’s unsuccessful interactions into synthetic positive examples using hindsight trajectory rewriting! 📜 https://t.co/Gj5t6TAtEM 🧑💻 https://t.co/DEJjfIA8Ud
5
26
175
Every time I watch models train, I wish I could tune LR on the fly. It's like cooking: we adjust the dial when the food smells off. We built Interactive Training to do that, turning loss monitoring into interaction. Paper👉 https://t.co/IuLI9HUPI0 Led by @wtzhang0820 w/ Yang Lu
5
34
198
I'll be at #ACL2025 next week! 🇦🇹 Things on my mind: curriculum learning, online adaptation, LM agents Where to find me: 1⃣ Monday: my team's poster on PeopleJoin (interning at Microsoft) 2⃣ Wednesday: discussing pre-pretraining in Panel 1 Excited to chat! DMs are open 😊
0
6
44
in sum, we need to: 1. embrace the fact that (log-)linear scaling laws don't always describe downstream tasks 2. better understand when scaling is and isn't predictable. 📜 https://t.co/SDBD5uIdbp many thanks to my co-authors @NickLourie @kchonyc ❤️ 🧵5/5
arxiv.org
Downstream scaling laws aim to predict task performance at larger scales from the model's performance at smaller scales. Whether such prediction should be possible is unclear: some works discover...
2
3
12
last, scaling trends aren't necessarily consistent between projects. some downstream tasks will look trendless or non-monotonic in one set of experiments but have a clear linear trend in another! 🧵4/5
1
1
8
furthermore, when we analyzed the downstream scaling data of https://t.co/I6Fdt9REkL, we found that only 39%(!!) of the tasks followed smooth linear scaling. the other 61% (a supermajority🧑⚖️) followed different pathologies. 🧵3/5
1
0
7
several factors affect the estimation of scaling laws: 1. pretraining setup and corpus 2. validation corpus 3. downstream task changing the validation data can entirely flip which pretraining setup appears better for a downstream task! 🧵2/5
2
0
10
📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜 https://t.co/SDBD5uIdbp 🧵1/5👇
4
39
281
RL can certainly teach LLMs new skills in principle, but in practice token-level exploration is so challenging that we end up relying on pretraining and synthetic data. the era of experience implies the era of exploration
A mental model I find useful: all data acquisition (web scrapes, synthetic data, RL rollouts, etc.) is really an exploration problem 🔍. This perspective has some interesting implications for where AI is heading. Wrote down some thoughts: https://t.co/VQLrYuJVAR
0
1
12
!!! I'm at #ICLR2025 to present 🧄Aioli🧄 a unified framework for data mixing on Thursday afternoon! 🔗 https://t.co/2uIlHKh1hS Message me to chat about pre/post training data (mixing, curriculum, understanding); test-time compute/verification; or to try new food 🇸🇬
2
51
153
it is my great honour to be appointed as the Glen se Vries Professor of Health Statistics. i have quickly written about this in my blog post:
Kyunghyun Cho (@kchonyc), Professor of Computer Science and Data Science, has been named recipient of the Glen de Vries Chair for Health Statistics by the Courant Institute and New York University. Congratulations!
32
17
339
also, s/o to @isabelpapad for getting me interested in this area:
aclanthology.org
Isabel Papadimitriou, Dan Jurafsky. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023.
0
2
22
Our work helps us understand the inductive biases that can improve model performance and suggests new ways to increase data efficiency in LM training . 📈 Link: https://t.co/I8ptzyIrF6 s/o to my amazing co-authors @jowenpetty Chuan Shi @lambdaviking @tallinzen ❤️ 🧵6/6
arxiv.org
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer?...
3
3
45
Why might formal pre-pretraining work? We also found evidence for a "syntactic subnetwork" that forms during pre-pretraining and is reused during natural language learning—a mechanistic path for the transfer of linguistic biases. 🔧 🧵5/6
1
1
41
So a small amount 🤏 of pre-pretraining makes pretraining more token-efficient. For Pythia-1B trained over 1.6B tokens, 30M tokens of Shuffle Dyck pre-pretraining matches 500M extra tokens of natural language, a token efficiency gain of 33%. 🧵4/6
1
3
46