Michael Hu @michahu8 X Profile

Michael Hu

@michahu8

Followers

857

Following

3K

Media

18

Statuses

133

NLP, language models | PhD @NYU | @NSF GRFP fellow | prev @microsoft, @princeton_nlp, @cocosci_lab.

https://t.co/LGoB4GCyH0

New York, NY

Joined August 2019

Don't wanna be here? Send us removal request.

Eric Bigelow

@EricBigelow

3 days

📝 New paper! Two strategies have emerged for controlling LLM behavior at inference time: in-context learning (ICL; i.e. prompting) and activation steering. We propose that both can be understood as altering model beliefs, formally in the sense of Bayesian belief updating. 1/9

8

21

125

Michael Hu

@michahu8

25 days

i'm getting really tired of "it's not x, it's y" and "do x, not y" in peoples' writing (more accurately, in their ghostwriter's writing). when i see it i honestly just move on and try to delete what i saw from my memory it's not just sloppy, it's also tasteless

1

0

8

Michael Hu

@michahu8

1 month

this work was done during my summer internship at microsoft with the wonderful @ben_vandurme @jacobandreas @harsh_jhamtani ❤️ fyi, the team is hiring summer interns for 2026 👀 https://t.co/JJ43MFzOFf

Benjamin Van Durme

@ben_vandurme

1 month

Summer '26 PhD research internships at Microsoft Copilot Tuning. Continual learning, complex reasoning and retrieval, nl2code, data efficient post-training. https://t.co/HM4cKqEhgW

0

3

Michael Hu

@michahu8

1 month

check out our code for xminigrid-stateful, our language-only environment for testing online learning in LLM agents, which you can find here: https://t.co/HCw6MMfECL

github.com

Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting - michahu/echo

1

4

Michael Hu

@michahu8

1 month

convert your LLM agent’s unsuccessful interactions into synthetic positive examples using hindsight trajectory rewriting! 📜 https://t.co/Gj5t6TAtEM 🧑‍💻 https://t.co/DEJjfIA8Ud

5

26

175

Yuntian Deng

@yuntiandeng

1 month

Every time I watch models train, I wish I could tune LR on the fly. It's like cooking: we adjust the dial when the food smells off. We built Interactive Training to do that, turning loss monitoring into interaction. Paper👉 https://t.co/IuLI9HUPI0 Led by @wtzhang0820 w/ Yang Lu

5

34

198

Michael Hu

@michahu8

4 months

I'll be at #ACL2025 next week! 🇦🇹 Things on my mind: curriculum learning, online adaptation, LM agents Where to find me: 1⃣ Monday: my team's poster on PeopleJoin (interning at Microsoft) 2⃣ Wednesday: discussing pre-pretraining in Panel 1 Excited to chat! DMs are open 😊

0

6

44

Michael Hu

@michahu8

5 months

in sum, we need to: 1. embrace the fact that (log-)linear scaling laws don't always describe downstream tasks 2. better understand when scaling is and isn't predictable. 📜 https://t.co/SDBD5uIdbp many thanks to my co-authors @NickLourie @kchonyc ❤️ 🧵5/5

arxiv.org

Downstream scaling laws aim to predict task performance at larger scales from the model's performance at smaller scales. Whether such prediction should be possible is unclear: some works discover...

2

3

12

Michael Hu

@michahu8

5 months

last, scaling trends aren't necessarily consistent between projects. some downstream tasks will look trendless or non-monotonic in one set of experiments but have a clear linear trend in another! 🧵4/5

1

8

Michael Hu

@michahu8

5 months

furthermore, when we analyzed the downstream scaling data of https://t.co/I6Fdt9REkL, we found that only 39%(!!) of the tasks followed smooth linear scaling. the other 61% (a supermajority🧑‍⚖️) followed different pathologies. 🧵3/5

1

0

7

Michael Hu

@michahu8

5 months

several factors affect the estimation of scaling laws: 1. pretraining setup and corpus 2. validation corpus 3. downstream task changing the validation data can entirely flip which pretraining setup appears better for a downstream task! 🧵2/5

2

0

10

Michael Hu

@michahu8

5 months

📢 today's scaling laws often don't work for predicting downstream task performance. For some pretraining setups, smooth and predictable scaling is the exception, not the rule. a quick read about scaling law fails: 📜 https://t.co/SDBD5uIdbp 🧵1/5👇

4

39

281

Michael Hu

@michahu8

5 months

RL can certainly teach LLMs new skills in principle, but in practice token-level exploration is so challenging that we end up relying on pretraining and synthetic data. the era of experience implies the era of exploration

Yiding Jiang

@yidingjiang

5 months

A mental model I find useful: all data acquisition (web scrapes, synthetic data, RL rollouts, etc.) is really an exploration problem 🔍. This perspective has some interesting implications for where AI is heading. Wrote down some thoughts: https://t.co/VQLrYuJVAR

0

1

12

Michael Hu

@michahu8

6 months

hot multi agent researcher summer

Michael Hu

@michahu8

1 year

hot interpretability researcher summer

0

9

Mayee Chen

@MayeeChen

7 months

!!! I'm at #ICLR2025 to present 🧄Aioli🧄 a unified framework for data mixing on Thursday afternoon! 🔗 https://t.co/2uIlHKh1hS Message me to chat about pre/post training data (mixing, curriculum, understanding); test-time compute/verification; or to try new food 🇸🇬

2

51

153

Kyunghyun Cho

@kchonyc

9 months

it is my great honour to be appointed as the Glen se Vries Professor of Health Statistics. i have quickly written about this in my blog post:

NYU Courant

@NYU_Courant

9 months

Kyunghyun Cho (@kchonyc), Professor of Computer Science and Data Science, has been named recipient of the Glen de Vries Chair for Health Statistics by the Courant Institute and New York University. Congratulations!

32

17

339

Michael Hu

@michahu8

9 months

also, s/o to @isabelpapad for getting me interested in this area:

aclanthology.org

Isabel Papadimitriou, Dan Jurafsky. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023.

0

2

22

Michael Hu

@michahu8

9 months

Our work helps us understand the inductive biases that can improve model performance and suggests new ways to increase data efficiency in LM training . 📈 Link: https://t.co/I8ptzyIrF6 s/o to my amazing co-authors @jowenpetty Chuan Shi @lambdaviking @tallinzen ❤️ 🧵6/6

arxiv.org

Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer?...

3

45

Michael Hu

@michahu8

9 months

Why might formal pre-pretraining work? We also found evidence for a "syntactic subnetwork" that forms during pre-pretraining and is reused during natural language learning—a mechanistic path for the transfer of linguistic biases. 🔧 🧵5/6

1

41

Michael Hu

@michahu8

9 months

So a small amount 🤏 of pre-pretraining makes pretraining more token-efficient. For Pythia-1B trained over 1.6B tokens, 30M tokens of Shuffle Dyck pre-pretraining matches 500M extra tokens of natural language, a token efficiency gain of 33%. 🧵4/6

1

3

46