I've done a deep dive into distributed training and efficient fine-tuning of LLMs. I get into the messy internals of DeepSpeed ZeRO and FSDP, summarize practical guidelines and highlight gotchas with multi-GPU training.
Do read, should be fun!
New deep dive on tokenization! 🔎
I've done a deep dive into the various aspects of tokenization, organized as bite-size chapters, with code and notebooks.
Check it out:
Some wild results here. They fine-tuned models on the auxillary training sets for different benchmarks. Trends for Phi-1.3B are so bizarre. From my understanding, there's no other data mixed in here for fine-tuning (apart from benchmark training data), and trends for ALL other…
Haha! The silver lining of all the LLM noise is that you get papers like this that you typically wouldn't see in academia. 😂
LLMs are entering the era of "natty or not".
@thesephist
This is the expanded neuron view in BertViz:
fyi BertViz has been there since 2019, but the visualization needs access to query, key vectors etc which needs custom architecture-specific code so the package only supported BERT, RoBERTa and GPT-2
In particular, some of the questions I focus on are:
1. Why focus on distributed training and performance? What happens under the hood with DeepSpeed and FSDP?
2. What hardware setup do I need?
3. What are the various efficient finetuning optimizations? What are the tradeoffs?
4. What are some practical guidelines that can capture all the major training optimizations, in order to train large models in a multi-GPU and multi-node setting?
5. What open-source codebases can I use right now? What are the pros and cons?
A lot of the details are also based on
@StasBekman
's investigations. My goal has been to elaborate on the latest distributed training strategies like DeepSpeed and FSDP, and make an updated list of practical guidelines for the current state of the
@huggingface
ecosystem.
There are a number of murky areas still with distributed training with
@huggingface
, esp. when it comes to DeepSpeed vs FSDP. I have experience mainly with just DeepSpeed, but it'll be interesting to hear the takes of those who've worked with both such as
@sourab_m
@abacaj
Doing a graduate PL course this quarter (my first PL course; better late than never) and lambda calculus is weird but so.... elegant.
Every programming concept/building block is simply its affordance, nothing more. What something is, is exactly what you can do with it.
This is aimed at a broad audience, but more relevant for LLM hackers trying to up their game, and thus is written in the same spirit as
@jeremyphoward
's hacker guides. This should also be relevant for startups/ companies just getting into fine-tuning open-source LLMs.
The full post is 7000+ words long! The current open-source documentation is very messy, so hopefully, this fills some of the voids in getting to understand what matters with performance and scaling when you're fine-tuning language models.
Quantum computers would work instantly with this one simple hack :
End-to-end fabrication
Fabricate chip -> Fabricate experiments -> Fabricate results.
A fun project I worked on the past few weeks:
Jester- A Text-to-Meme generation engine! Enter any text you like and you can make a meme in a matter of seconds. Try it out here:
Question: Has there been any study that says increasing vocab size 2x (50k-> 100k) leads to x% imp. in fertility/compression on the SAME corpus, like CommonCrawl?
The different vocab sizes across models (32K Llama, 50K GPT2, 100k GPT4) are not directly comparable because the…
There's a weird reality that we mostly ignore in language modeling. It's the fact that we don't _actually_ train these models end-to-end.
That's because we have the tokenizer! It's actually a really frustrating piece to tune with sometimes small changes mattering a lot and…
@abacaj
GenAI strategist resisting the urge to say "in-context learning" and "Large Language Models" instead of just fine-tuning DeBERTa on the million-odd training data they have (challenge impossible)
@cHHillee
@jeremyphoward
Perfect! One small note - the second all-gather in ZeRO-1/2 is all-gather for updated model parameters, unlike the equivalent all-gather stage of all-reduce in DDP, which is all-gather of gradients.
In other news, I'm happy to have played a small part in the new PEFT release by adding IA3! IA3 is a highly memory-efficient PEFT method that has LoRA-like performance! Was also my first open-source contribution so that was fun. PR here:
Super interesting release!
The brief summary for why this language-specific focus is much needed, and some comparison with Llama-2-7B-chat:
- Indic languages have complex script rules. Predominantly fusional, + sometimes agglutinative -add a vowel to change gender, tense,etc 🧵
🚨New model alert🚨
We're super excited to release OpenHathi-Hi-v0.1, the first Hindi LLM from our OpenHathi series of models. This model is trained under compute and data constraints to show that we can get GPT-3.5-like performance on Indic languages with a frugal budget. 1/5
@StasBekman
Will correct! Looks like I used bandwidth mentioned for Inf instances by mistake. I want to also mention a reference to AWS docs for completion. I believe this is from the P4d instance (the one for training) docs:
Is that right?
The internet democratised the means to reach a wide audience, but the result was that salaries became more long-tailed, not less. Not surprising if you see that before, people had to make do with what was available locally. The internet removed that restriction.
@monteskw
@pHequals7
If people were not " educated enough ", they would have chosen the 72000 Rs. promised by the Congress happily. They. Didn't. Don't underestimate the average Indian voter.
the data is clear
someone is going to release an energy drink with 600mg of caffeine in the next 5 years.
if this scaling law holds, we’ll be over 1200mg of caffeine by 2036.
the future is going to be insane 🚀
I finally got to go over
@ilyasut
's talk at Simons Institute this week, and it was brilliant! In the same spirit as
@DrJimFan
's excellent summary, my detailed notes are here:
Unlike most other lectures in the LLM workshop where Ilya presented, there…
There're few who can deliver both great AI research and charismatic talks. OpenAI Chief Scientist
@ilyasut
is one of them.
I watched Ilya's lecture at Simons Institute, where he delved into why unsupervised learning works through the lens of compression.
Sharing my notes:
-…
[4/8]
4. Challenges with Tokenization: Challenges with integer tokenization, tokenization for non-English languages and going multilingual, with a focus on the recent No Language Left Behind (NLLB) effort from Meta.
[2/8]
1. Intro: A quick introduction on tokens and the different tokenization algorithms out there.
2. BPE: A closer look at the Byte-Pair Encoding tokenization algorithm. We'll also go over a minimal implementation for training a BPE model.
Is the Falcon-40b paper still not out? I'm curious about how they trained their tokenizer. Fun fact: All integers from 0-999 get segmented as 1 token, except 957, which gets 2 tokens (??).
Went for a run after a long break. Struggled at the end, but after a quick glance around, I yelled "Who's gonna carry the boats?". Can confirm that's an instant 20% stamina boost.
Something to add to this: As put by
@nntaleb
, we as anticipation machines. Our ability to mentally project and contemplate different actions without performing them allows us to cheat evolution. The same circuits seem to be also interfering with our emotional health?
@thesephist
Np! Tbh I've been trying to find what I can use to visualize neurons/attention with in-context examples - Ex: which few shot example mattered more in the prompt (do you know any :p). Per-token view of BertViz crashes my notebook if you have hundreds of tokens (and with 100s of…
The only prerequisite assumed is a basic introduction to NLP/language models, such as
@jeremyphoward
's LLM for hackers intro or
@JayAlammar
's videos on LLM tokenizers. I've tried to create a complementary resource to
@huggingface
's NLP course. Topics Covered:
Okay I tried Bard after quite some time, and now with extensions, and it's actually very good! I tried out some queries to get information from YouTube, Flights, Gmail and regular Search. It's very fast, and mostly works. From initial testing, some limitations/issues 🧵
🚨Bard feature drop: Extensions!
- connects with Google services
- retrieves from, reasons over and composes APIs
- powered by PaLM 2
First step in the long arc of shipping LLM-powered 'agents' in consumer-grade products
Far from done but excited to share this with the 🌎
🧵
@pHequals7
@ewarren
I think her main problem is with the fact that they operate the iOS app Store while also distributing their own apps in the platform.
(n/n)
However, if the model first outputs English text, and then translates the answer to Hindi, it outperforms GPT-4 on (Hindi-translated) MT-bench - mainly because of Llama-2-7B's English-heavy pretraining. This is thinking in English, and speaking in Hindi!
How it works: We trained a Transformer model to classify user text and show 10 relevant templates. For each template, we labelled (manually!!) 5 example (user prompt, meme caption) pairs to make a custom prompt for GPT-3. Code and more details here:
(3/n)
on GPT-4:
- Too fine-grained a tokenization has downstream effects on performance. On the task of toxicity detection for different translations, Indic languages had the lowest peformance with Meta's No Language Left Behind models.
@paulg
on thinking the unthinkable:
"If you can think things so outside the box that they'd make people's hair stand on end, you'll have no trouble with the small trips outside the box that people call innovative."
@pHequals7
@ShitUserStory
Just use Lists. Everything's more organised and you can stay sane. Only issue is that you don't see the account's likes, which you can somewhat get around with more Lists :)
This was a course project under
@yuqirose
, and we had essentially 8 weeks. One thing I still want to do is to make the whole thing run on the browser - Streamlit is good but still slow, and concurrency is a question mark.
[5/8]
5. Puzzles: Some simple puzzles to get you thinking about pre-tokenization, vocabulary size, etc. We revisit
@Thom_Wolf
's tokenizer puzzle on vocabulary size.
Amazing text to music generations from
@suno_ai_
, could easily see these taking over leaderboards.
Personal favorite: this song I fished out of their Discord a few months ago, "Return to Monkey", which has been stuck in my head since :D
[00:57]
I wanna return to monkey, I…
"Social media platforms like Twitter amplify expressions of moral outrage over time, because users learn such language gets rewarded with an increased number of 'likes' and 'shares,' a new Yale University study shows."
(2/n)
- Too fine-grained a tokenization is bad. You need large vocab sizes DEDICATED for that language.
- Current models have large vocab sizes but little dedicated to languages like Hindi. GPT-4 has 100K vocab, but still gives 5 times more tokens for Hindi than English.
@itsclivetime
@jeremyphoward
Tbh Accelerate is meant to be the opposite (very general), and does a pretty good job of providing a unified interface to switch between different distributed training strategies. By default, there will be json/yaml hell involved because it wraps around FSDP and DeepSpeed. You…
Google Flights and YouTube plugins seem to just work, at least for the current set of features. Sometimes you do have problems like incorrect retrieval (wrong flight price, for example), but overall pretty good and helpful!
Meditations on Violence
"In this ritual, members of a group compete for status and show their loyalty by how vicious they can be to an 'outsider.' Pleading, fighting, passivity will be interpreted as proof of 'otherness' and justification to escalate."
Reminds me of what firefighter Paul Gleason said about his crew leadership: It's not decision making, it's sensemaking.
"If I make a decision,...I take pride in it....and not listen to those who question it...If I make sense, then this is more dynamic...and I can change it."