Margaret Li @margs_li X Profile

Margaret Li

@margs_li

Followers

1K

Following

315

Media

21

Statuses

71

👩‍💻 PhD student @UWCSE / @UWNLP & @MetaAI. Formerly RE @FacebookAI Research, @Penn CS | 🏂💃🧋🥯 certified bi-coastal bb IAH/PEK/PHL/NYC/SFO/SEA

https://t.co/Bn3HLtVXgO

Joined June 2019

Don't wanna be here? Send us removal request.

Margaret Li

@margs_li

9 months

We nearly drove ourselves insane trying to reproduce scaling laws papers 📉 So of course we wrote a paper about it 😵‍💫 1/9

1

30

161

Margaret Li

@margs_li

9 months

To appear at #ICLR2025 Thanks to the coauthors @snehaark @LukeZettlemoyer who descended into this madness with me Arxiv: https://t.co/nuwuPV2IoU Code: https://t.co/2oy1RYpvGU Checkpoints: https://t.co/mhWqlJMF0k 9/9

huggingface.co

0

1

12

Margaret Li

@margs_li

9 months

(4) How are we optimizing the fit? (Loss? Optimizer? Initialization?) 📈: Do we need to perform a grid search for our initializations? How many points should we try? What happens if we initialize from a hypothesized law (e.g. chinchilla)? 8/9

1

0

1

Margaret Li

@margs_li

9 months

(3) How do we extract data after training? (Filtering? Interpolation? Mid training checkpoint inclusion?) 📈: What happens if we only fit to the final checkpoints vs using mid-training checkpoints? Can we use the earliest checkpoints or only later checkpoints? 7/9

1

0

2

Margaret Li

@margs_li

9 months

(2) How do we train models? (Hyperparameters? Architecture? Training and eval data?) 📈: What happens if we sweep our learning rates vs fix them? Does the value of the fixed learning rate matter? 6/9

1

0

2

Margaret Li

@margs_li

9 months

(1) What form are we fitting? (What parameters? What inputs? With what relation?) 📈: What happens if we try to fit to the Chinchilla form directly? What if we fit IsoFLOP curves? What if we assume N and D scale linearly w each other? 5/9

1

0

3

Margaret Li

@margs_li

9 months

From this, we made lists of algorithm details and categorized them. We then implemented and compared how these details affect the scaling law fit (it’s really dramatic). The categories and a sneak peek at the analyses: 4/9

1

0

4

Margaret Li

@margs_li

9 months

We pored over as many scaling laws papers as we could find (51 in total) to examine how details (when they’re provided 💀) and conclusions vary 3/9

1

0

6

Margaret Li

@margs_li

9 months

Scaling laws repro is very brittle! Seemingly minor implementation differences swing the final predictions drastically — should you train your model on 10B or 1T data points? 🤷‍♀️ 2/9

1

0

6

Margaret Li

@margs_li

1 year

Of course, there’s much more in the paper than we could fit in a tweet thread! Paper: https://t.co/9P1YLntwLA And thanks to all my amazing co-authors: @WeijiaShi2, @ArtidoroPagnoni, @PeterWestTM, and @universeinanegg!

0

20

Margaret Li

@margs_li

1 year

We use span-alignment algorithms from Bioinformatics to quantify the implicit outline that RLHF’d models use. It turns out that even when using truncated sampling to compensate for differences in diversity, aligned models exhibit significantly more overlap than their base LMs.

1

0

12

Margaret Li

@margs_li

1 year

It turns out the improvement in long-form generation RLHF’d models exhibit is mirrored by inaccuracy on the wider distribution base LLMs are trained on. RLHF’d models aren’t quite as good next-token-predictors anymore, even with finetuning, and even on chat data meant for RLHF!

1

0

9

Margaret Li

@margs_li

1 year

RLHF-aligned LMs excel at long-form generation, but how? We show how current models rely on anchor spans ⚓: strings that occur across many samples for the same prompt, forming an implicit outline, viz below.

6

39

220

Sachin Gururangan

@ssgrn

2 years

.@colinraffel, @margs_li, @SamuelAinsworth, and I are proposing a workshop on Collaborative, Communal, and Continual Machine Learning at NeurIPS 2023! If you'd like to be a reviewer for our workshop, please sign up here:

docs.google.com

Thanks for considering to be a reviewer for our proposed workshop on Collaborative, Communal and Continual Machine Learning (CoML) at NeurIPS 2023. We will contact you with further details if our...

2

11

35

Margaret Li

@margs_li

3 years

Yes, the sneak peek is a joke, generated by @AnthropicAI 's Claude. The message is not though! We're super excited to discuss modular / sparse LLMs and how we train them ☺️

1

0

7

Margaret Li

@margs_li

3 years

Sneak peek: "My fellow AI practitioners, I come to you today to spread the good news of embarrassingly parallel training of expert models. Too often we limit ourselves to single monolithic models. No more I say! The path to AI enlightenment is through specialization."

Stanford NLP Group

@stanfordnlp

3 years

For this week's NLP Seminar, we are excited to host @margs_li and @ssgrn ! The talk will happen Thursday at 11 AM PT. Non-Stanford affiliates registration link: https://t.co/zylKtG6Jtn. Information will be sent out one hour before the talk.

1

5

70

Terra Blevins

@TerraBlvns

3 years

New paper alert!! ✨ Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models (PLMs) ✨ We evaluate how well PLMs translate words in context and then leverage this prompting setup to perform zero-shot WSD on 18 languages! 1/n

1

25

61

Mitchell Wortsman

@Mitchnw

3 years

Sharing our project on 1) accelerating and 2) stabilizing training for large language-vision models 1) Towards accelerating training, we introduce SwitchBack, a linear layer for int8 quantized training which matches bfloat16 within 0.1 for CLIP ViT-Huge https://t.co/MqqxtTZfi9

5

57

221

kache

@yacineMTB

3 years

I can't stop thinking about jart's optimization (ty for teaching!): the idea of keeping weights away in swap when not needed + this paper train many LMs, each specialized on 1 cluster of the corpa. you can get away with using 2 or 4 for inference. https://t.co/Lq1h4SvxqH

2

12

170

Margaret Li

@margs_li

3 years

n+1 / n : I'm thinking of starting a fanclub called BTM Ent Army to... well, if you've gotten this much of the joke, you can fill in the last piece and figure out what we'd do 🙃

0

5