michahu8 Profile Banner
Michael Hu Profile
Michael Hu

@michahu8

Followers
857
Following
3K
Media
18
Statuses
133

NLP, language models | PhD @NYU | @NSF GRFP fellow | prev @microsoft, @princeton_nlp, @cocosci_lab.

New York, NY
Joined August 2019
Don't wanna be here? Send us removal request.
@EricBigelow
Eric Bigelow
3 days
📝 New paper! Two strategies have emerged for controlling LLM behavior at inference time: in-context learning (ICL; i.e. prompting) and activation steering. We propose that both can be understood as altering model beliefs, formally in the sense of Bayesian belief updating. 1/9
8
21
125
@michahu8
Michael Hu
25 days
i'm getting really tired of "it's not x, it's y" and "do x, not y" in peoples' writing (more accurately, in their ghostwriter's writing). when i see it i honestly just move on and try to delete what i saw from my memory it's not just sloppy, it's also tasteless
1
0
8
@michahu8
Michael Hu
1 month
this work was done during my summer internship at microsoft with the wonderful @ben_vandurme @jacobandreas @harsh_jhamtani ❤️ fyi, the team is hiring summer interns for 2026 👀 https://t.co/JJ43MFzOFf
@ben_vandurme
Benjamin Van Durme
1 month
Summer '26 PhD research internships at Microsoft Copilot Tuning. Continual learning, complex reasoning and retrieval, nl2code, data efficient post-training. https://t.co/HM4cKqEhgW
0
0
3
@michahu8
Michael Hu
1 month
check out our code for xminigrid-stateful, our language-only environment for testing online learning in LLM agents, which you can find here: https://t.co/HCw6MMfECL
Tweet card summary image
github.com
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting - michahu/echo
1
1
4
@michahu8
Michael Hu
1 month
convert your LLM agent’s unsuccessful interactions into synthetic positive examples using hindsight trajectory rewriting! 📜 https://t.co/Gj5t6TAtEM 🧑‍💻 https://t.co/DEJjfIA8Ud
5
26
175
@yuntiandeng
Yuntian Deng
1 month
Every time I watch models train, I wish I could tune LR on the fly. It's like cooking: we adjust the dial when the food smells off. We built Interactive Training to do that, turning loss monitoring into interaction. Paper👉 https://t.co/IuLI9HUPI0 Led by @wtzhang0820 w/ Yang Lu
5
34
198
@michahu8
Michael Hu
4 months
I'll be at #ACL2025 next week! 🇦🇹 Things on my mind: curriculum learning, online adaptation, LM agents Where to find me: 1⃣ Monday: my team's poster on PeopleJoin (interning at Microsoft) 2⃣ Wednesday: discussing pre-pretraining in Panel 1 Excited to chat! DMs are open 😊
0
6
44
@michahu8
Michael Hu
5 months
in sum, we need to: 1. embrace the fact that (log-)linear scaling laws don't always describe downstream tasks 2. better understand when scaling is and isn't predictable. 📜 https://t.co/SDBD5uIdbp many thanks to my co-authors @NickLourie @kchonyc ❤️ 🧵5/5
Tweet card summary image
arxiv.org
Downstream scaling laws aim to predict task performance at larger scales from the model's performance at smaller scales. Whether such prediction should be possible is unclear: some works discover...
2
3
12
@michahu8
Michael Hu
5 months
last, scaling trends aren't necessarily consistent between projects. some downstream tasks will look trendless or non-monotonic in one set of experiments but have a clear linear trend in another! 🧵4/5
1
1
8
@michahu8
Michael Hu
5 months
furthermore, when we analyzed the downstream scaling data of https://t.co/I6Fdt9REkL, we found that only 39%(!!) of the tasks followed smooth linear scaling. the other 61% (a supermajority🧑‍⚖️) followed different pathologies. 🧵3/5
1
0
7
@michahu8
Michael Hu
5 months
several factors affect the estimation of scaling laws: 1. pretraining setup and corpus 2. validation corpus 3. downstream task changing the validation data can entirely flip which pretraining setup appears better for a downstream task! 🧵2/5
2
0
10
@michahu8
Michael Hu
5 months
📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜 https://t.co/SDBD5uIdbp 🧵1/5👇
4
39
281
@michahu8
Michael Hu
5 months
RL can certainly teach LLMs new skills in principle, but in practice token-level exploration is so challenging that we end up relying on pretraining and synthetic data. the era of experience implies the era of exploration
@yidingjiang
Yiding Jiang
5 months
A mental model I find useful: all data acquisition (web scrapes, synthetic data, RL rollouts, etc.) is really an exploration problem 🔍. This perspective has some interesting implications for where AI is heading. Wrote down some thoughts: https://t.co/VQLrYuJVAR
0
1
12
@michahu8
Michael Hu
6 months
hot multi agent researcher summer
@michahu8
Michael Hu
1 year
hot interpretability researcher summer
0
0
9
@MayeeChen
Mayee Chen
7 months
!!! I'm at #ICLR2025 to present 🧄Aioli🧄 a unified framework for data mixing on Thursday afternoon! 🔗 https://t.co/2uIlHKh1hS Message me to chat about pre/post training data (mixing, curriculum, understanding); test-time compute/verification; or to try new food 🇸🇬
2
51
153
@kchonyc
Kyunghyun Cho
9 months
it is my great honour to be appointed as the Glen se Vries Professor of Health Statistics. i have quickly written about this in my blog post:
@NYU_Courant
NYU Courant
9 months
Kyunghyun Cho (@kchonyc), Professor of Computer Science and Data Science, has been named recipient of the Glen de Vries Chair for Health Statistics by the Courant Institute and New York University. Congratulations!
32
17
339
@michahu8
Michael Hu
9 months
Our work helps us understand the inductive biases that can improve model performance and suggests new ways to increase data efficiency in LM training . 📈 Link: https://t.co/I8ptzyIrF6 s/o to my amazing co-authors @jowenpetty Chuan Shi @lambdaviking @tallinzen ❤️ 🧵6/6
Tweet card summary image
arxiv.org
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer?...
3
3
45
@michahu8
Michael Hu
9 months
Why might formal pre-pretraining work? We also found evidence for a "syntactic subnetwork" that forms during pre-pretraining and is reused during natural language learning—a mechanistic path for the transfer of linguistic biases. 🔧 🧵5/6
1
1
41
@michahu8
Michael Hu
9 months
So a small amount 🤏 of pre-pretraining makes pretraining more token-efficient. For Pythia-1B trained over 1.6B tokens, 30M tokens of Shuffle Dyck pre-pretraining matches 500M extra tokens of natural language, a token efficiency gain of 33%. 🧵4/6
1
3
46