Margaret Li
@margs_li
Followers
1K
Following
315
Media
21
Statuses
71
π©βπ» PhD student @UWCSE / @UWNLP & @MetaAI. Formerly RE @FacebookAI Research, @Penn CS | πππ§π₯― certified bi-coastal bb IAH/PEK/PHL/NYC/SFO/SEA
Joined June 2019
We nearly drove ourselves insane trying to reproduce scaling laws papers π So of course we wrote a paper about itΒ π΅βπ« 1/9
1
30
161
To appear at #ICLR2025 Thanks to the coauthors @snehaark @LukeZettlemoyer who descended into this madness with me Arxiv: https://t.co/nuwuPV2IoU Code: https://t.co/2oy1RYpvGU Checkpoints: https://t.co/mhWqlJMF0k 9/9
huggingface.co
0
1
12
(4) How are we optimizing the fit? (Loss? Optimizer? Initialization?) π: Do we need to perform a grid search for our initializations? How many points should we try? What happens if we initialize from a hypothesized law (e.g. chinchilla)? 8/9
1
0
1
(3) How do we extract data after training? (Filtering? Interpolation? Mid training checkpoint inclusion?) π: What happens if we only fit to the final checkpoints vs using mid-training checkpoints? Can we use the earliest checkpoints or only later checkpoints? 7/9
1
0
2
(2) How do we train models? (Hyperparameters? Architecture? Training and eval data?) π: What happens if we sweep our learning rates vs fix them? Does the value of the fixed learning rate matter? 6/9
1
0
2
(1) What form are we fitting? (What parameters? What inputs? With what relation?) π: What happens if we try to fit to the Chinchilla form directly? What if we fit IsoFLOP curves? What if we assume N and D scale linearly w each other? 5/9
1
0
3
From this, we made lists of algorithm details and categorized them. We then implemented and compared how these details affect the scaling law fit (itβs really dramatic). The categories and a sneak peek at the analyses: 4/9
1
0
4
We pored over as many scaling laws papers as we could find (51 in total) to examine how details (when theyβre provided π) and conclusions vary 3/9
1
0
6
Scaling laws repro is very brittle! Seemingly minor implementation differences swing the final predictions drastically β should you train your model on 10B or 1T data points? π€·ββοΈ 2/9
1
0
6
Of course, thereβs much more in the paper than we could fit in a tweet thread! Paper: https://t.co/9P1YLntwLA And thanks to all my amazing co-authors: @WeijiaShi2, @ArtidoroPagnoni, @PeterWestTM, and @universeinanegg!
0
0
20
We use span-alignment algorithms from Bioinformatics to quantify the implicit outline that RLHFβd models use. It turns out that even when using truncated sampling to compensate for differences in diversity, aligned models exhibit significantly more overlap than their base LMs.
1
0
12
It turns out the improvement in long-form generation RLHFβd models exhibit is mirrored by inaccuracy on the wider distribution base LLMs are trained on. RLHFβd models arenβt quite as good next-token-predictors anymore, even with finetuning, and even on chat data meant for RLHF!
1
0
9
RLHF-aligned LMs excel at long-form generation, but how? We show how current models rely on anchor spans β: strings that occur across many samples for the same prompt, forming an implicit outline, viz below.
6
39
220
.@colinraffel, @margs_li, @SamuelAinsworth, and I are proposing a workshop on Collaborative, Communal, and Continual Machine Learning at NeurIPS 2023! If you'd like to be a reviewer for our workshop, please sign up here:
docs.google.com
Thanks for considering to be a reviewer for our proposed workshop on Collaborative, Communal and Continual Machine Learning (CoML) at NeurIPS 2023. We will contact you with further details if our...
2
11
35
Yes, the sneak peek is a joke, generated by @AnthropicAI 's Claude. The message is not though! We're super excited to discuss modular / sparse LLMs and how we train them βΊοΈ
1
0
7
Sneak peek: "My fellow AI practitioners, I come to you today to spread the good news of embarrassingly parallel training of expert models. Too often we limit ourselves to single monolithic models. No more I say! The path to AI enlightenment is through specialization."
For this week's NLP Seminar, we are excited to host @margs_li and @ssgrn ! The talk will happen Thursday at 11 AM PT. Non-Stanford affiliates registration link: https://t.co/zylKtG6Jtn. Information will be sent out one hour before the talk.
1
5
70
New paper alert!! β¨ Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models (PLMs) β¨ We evaluate how well PLMs translate words in context and then leverage this prompting setup to perform zero-shot WSD on 18 languages! 1/n
1
25
61
Sharing our project on 1) accelerating and 2) stabilizing training for large language-vision models 1) Towards accelerating training, we introduce SwitchBack, a linear layer for int8 quantized training which matches bfloat16 within 0.1 for CLIP ViT-Huge https://t.co/MqqxtTZfi9
5
57
221
I can't stop thinking about jart's optimization (ty for teaching!): the idea of keeping weights away in swap when not needed + this paper train many LMs, each specialized on 1 cluster of the corpa. you can get away with using 2 or 4 for inference. https://t.co/Lq1h4SvxqH
2
12
170
n+1 / n : I'm thinking of starting a fanclub called BTM Ent Army to... well, if you've gotten this much of the joke, you can fill in the last piece and figure out what we'd do π
0
0
5