Sumanth Hegde Profile Banner
Sumanth Hegde Profile
Sumanth Hegde

@sumanthrh

Followers
785
Following
18
Media
48
Statuses
225

MS CS @UCSanDiego . Previously, GenAI @C3_AI . EE @iitmadras . Machine Learning and Systems. Intensity is all you need.

San Francisco
Joined February 2016
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
@sumanthrh
Sumanth Hegde
7 months
I've done a deep dive into distributed training and efficient fine-tuning of LLMs. I get into the messy internals of DeepSpeed ZeRO and FSDP, summarize practical guidelines and highlight gotchas with multi-GPU training. Do read, should be fun!
13
166
930
@sumanthrh
Sumanth Hegde
5 months
New deep dive on tokenization! 🔎 I've done a deep dive into the various aspects of tokenization, organized as bite-size chapters, with code and notebooks. Check it out:
4
48
312
@sumanthrh
Sumanth Hegde
6 months
Some wild results here. They fine-tuned models on the auxillary training sets for different benchmarks. Trends for Phi-1.3B are so bizarre. From my understanding, there's no other data mixed in here for fine-tuning (apart from benchmark training data), and trends for ALL other…
Tweet media one
@agihippo
yi 🦛
6 months
Haha! The silver lining of all the LLM noise is that you get papers like this that you typically wouldn't see in academia. 😂 LLMs are entering the era of "natty or not".
3
7
127
1
2
29
@sumanthrh
Sumanth Hegde
7 months
I'm not very religious, but I do like to read the Holy Scriptures
Tweet media one
1
0
21
@sumanthrh
Sumanth Hegde
5 months
@abacaj Spam classification of 20 word emails using Falcon-180B
1
1
21
@sumanthrh
Sumanth Hegde
6 months
🤗PEFT contributors list after the new release. Screenshotting this before it changes :p
Tweet media one
0
1
12
@sumanthrh
Sumanth Hegde
5 months
@thesephist This is the expanded neuron view in BertViz: fyi BertViz has been there since 2019, but the visualization needs access to query, key vectors etc which needs custom architecture-specific code so the package only supported BERT, RoBERTa and GPT-2
1
1
13
@sumanthrh
Sumanth Hegde
7 months
In particular, some of the questions I focus on are: 1. Why focus on distributed training and performance? What happens under the hood with DeepSpeed and FSDP? 2. What hardware setup do I need? 3. What are the various efficient finetuning optimizations? What are the tradeoffs?
1
0
12
@sumanthrh
Sumanth Hegde
1 year
I absolutely did not expect to see people like @realGeorgeHotz and @TJEvarts build Twitter features in real-time on Twitter. Just mindblowing stuff.
2
0
9
@sumanthrh
Sumanth Hegde
7 months
4. What are some practical guidelines that can capture all the major training optimizations, in order to train large models in a multi-GPU and multi-node setting? 5. What open-source codebases can I use right now? What are the pros and cons?
1
0
9
@sumanthrh
Sumanth Hegde
6 months
@jeremyphoward Now we just need Epic Rap Battles @ERBofHistory to pick this up!
1
0
8
@sumanthrh
Sumanth Hegde
5 months
@charles_irl The OG staff engineer
Tweet media one
1
0
9
@sumanthrh
Sumanth Hegde
7 months
A lot of the details are also based on @StasBekman 's investigations. My goal has been to elaborate on the latest distributed training strategies like DeepSpeed and FSDP, and make an updated list of practical guidelines for the current state of the @huggingface ecosystem.
3
0
8
@sumanthrh
Sumanth Hegde
7 months
There are a number of murky areas still with distributed training with @huggingface , esp. when it comes to DeepSpeed vs FSDP. I have experience mainly with just DeepSpeed, but it'll be interesting to hear the takes of those who've worked with both such as @sourab_m @abacaj
0
0
8
@sumanthrh
Sumanth Hegde
7 months
Doing a graduate PL course this quarter (my first PL course; better late than never) and lambda calculus is weird but so.... elegant. Every programming concept/building block is simply its affordance, nothing more. What something is, is exactly what you can do with it.
1
0
8
@sumanthrh
Sumanth Hegde
7 months
This is aimed at a broad audience, but more relevant for LLM hackers trying to up their game, and thus is written in the same spirit as @jeremyphoward 's hacker guides. This should also be relevant for startups/ companies just getting into fine-tuning open-source LLMs.
1
0
8
@sumanthrh
Sumanth Hegde
7 months
Went a little overboard and got a new MacBook Pro with M2 Max and 64GB memory. Can't wait to use all this power to scroll through tweets.
Tweet media one
0
0
7
@sumanthrh
Sumanth Hegde
7 months
@KennethCassel @yacineMTB Git blame -> Git denial -> Git rage -> Git acceptance
0
1
7
@sumanthrh
Sumanth Hegde
7 months
The full post is 7000+ words long! The current open-source documentation is very messy, so hopefully, this fills some of the voids in getting to understand what matters with performance and scaling when you're fine-tuning language models.
1
0
6
@sumanthrh
Sumanth Hegde
5 months
Quantum computers would work instantly with this one simple hack : End-to-end fabrication Fabricate chip -> Fabricate experiments -> Fabricate results.
0
0
6
@sumanthrh
Sumanth Hegde
4 months
Tweet media one
0
0
6
@sumanthrh
Sumanth Hegde
1 year
A fun project I worked on the past few weeks: Jester- A Text-to-Meme generation engine! Enter any text you like and you can make a meme in a matter of seconds. Try it out here:
1
0
5
@sumanthrh
Sumanth Hegde
7 months
Gentle Reminder
Tweet media one
0
0
4
@sumanthrh
Sumanth Hegde
5 months
Question: Has there been any study that says increasing vocab size 2x (50k-> 100k) leads to x% imp. in fertility/compression on the SAME corpus, like CommonCrawl? The different vocab sizes across models (32K Llama, 50K GPT2, 100k GPT4) are not directly comparable because the…
Tweet media one
@andrew_n_carr
Andrew Carr (e/🤸)
5 months
There's a weird reality that we mostly ignore in language modeling. It's the fact that we don't _actually_ train these models end-to-end. That's because we have the tokenizer! It's actually a really frustrating piece to tune with sometimes small changes mattering a lot and…
Tweet media one
19
18
235
0
0
5
@sumanthrh
Sumanth Hegde
5 months
Sad to report I can no longer enjoy the fights in Jujutsu Kaisen. I just don't want to be hurt again.😭
1
0
4
@sumanthrh
Sumanth Hegde
6 months
Personal trauma? Na. That's just my heavenly restriction.
Tweet media one
0
1
4
@sumanthrh
Sumanth Hegde
5 months
@abacaj GenAI strategist resisting the urge to say "in-context learning" and "Large Language Models" instead of just fine-tuning DeBERTa on the million-odd training data they have (challenge impossible)
0
0
4
@sumanthrh
Sumanth Hegde
7 months
@cHHillee @jeremyphoward Perfect! One small note - the second all-gather in ZeRO-1/2 is all-gather for updated model parameters, unlike the equivalent all-gather stage of all-reduce in DDP, which is all-gather of gradients.
1
0
4
@sumanthrh
Sumanth Hegde
10 months
In other news, I'm happy to have played a small part in the new PEFT release by adding IA3! IA3 is a highly memory-efficient PEFT method that has LoRA-like performance! Was also my first open-source contribution so that was fun. PR here:
0
0
3
@sumanthrh
Sumanth Hegde
6 months
@Thom_Wolf Range: How Generalists Triumph in a Specialized World by @DavidEpstein
0
0
1
@sumanthrh
Sumanth Hegde
5 months
Magic!
@Ror_Fly
Rory Flynn
5 months
Motion brush in runway...damn good. Much better control. This was very needed. #runwayml #midjourneyV52 #AIArtCommuity #aiart
89
1K
8K
0
0
3
@sumanthrh
Sumanth Hegde
5 months
Super interesting release! The brief summary for why this language-specific focus is much needed, and some comparison with Llama-2-7B-chat: - Indic languages have complex script rules. Predominantly fusional, + sometimes agglutinative -add a vowel to change gender, tense,etc 🧵
@SarvamAI
Sarvam AI
5 months
🚨New model alert🚨 We're super excited to release OpenHathi-Hi-v0.1, the first Hindi LLM from our OpenHathi series of models. This model is trained under compute and data constraints to show that we can get GPT-3.5-like performance on Indic languages with a frugal budget. 1/5
40
109
707
1
0
4
@sumanthrh
Sumanth Hegde
7 months
@StasBekman Will correct! Looks like I used bandwidth mentioned for Inf instances by mistake. I want to also mention a reference to AWS docs for completion. I believe this is from the P4d instance (the one for training) docs: Is that right?
1
0
3
@sumanthrh
Sumanth Hegde
3 years
The internet democratised the means to reach a wide audience, but the result was that salaries became more long-tailed, not less. Not surprising if you see that before, people had to make do with what was available locally. The internet removed that restriction.
0
0
2
@sumanthrh
Sumanth Hegde
5 years
@monteskw @pHequals7 If people were not " educated enough ", they would have chosen the 72000 Rs. promised by the Congress happily. They. Didn't. Don't underestimate the average Indian voter.
0
0
1
@sumanthrh
Sumanth Hegde
3 years
Ok these self-help videos on YouTube are going too far...
Tweet media one
0
0
2
@sumanthrh
Sumanth Hegde
3 years
Have probably been on the Cowin portal for probably 3 hours today, but still got nothing. Will do Saraswathi Devi pooja and try again tomorrow.
0
0
2
@sumanthrh
Sumanth Hegde
8 months
Can't wait!!
Tweet media one
0
0
3
@sumanthrh
Sumanth Hegde
3 years
@pHequals7 Conan O'Brien? bruh
Tweet media one
1
0
1
@sumanthrh
Sumanth Hegde
5 months
[7/8] 7. Galactica: Thinking about tokenizer design by diving into the Galactica paper from Meta AI.
1
0
3
@sumanthrh
Sumanth Hegde
5 months
The "Bitter" lesson of Energy drink scaling laws - just add more caffeine (+sugar)
@johncoogan
John Coogan
5 months
the data is clear someone is going to release an energy drink with 600mg of caffeine in the next 5 years. if this scaling law holds, we’ll be over 1200mg of caffeine by 2036. the future is going to be insane 🚀
Tweet media one
228
169
3K
1
0
4
@sumanthrh
Sumanth Hegde
8 months
I finally got to go over @ilyasut 's talk at Simons Institute this week, and it was brilliant! In the same spirit as @DrJimFan 's excellent summary, my detailed notes are here: Unlike most other lectures in the LLM workshop where Ilya presented, there…
@DrJimFan
Jim Fan
9 months
There're few who can deliver both great AI research and charismatic talks. OpenAI Chief Scientist @ilyasut is one of them. I watched Ilya's lecture at Simons Institute, where he delved into why unsupervised learning works through the lens of compression. Sharing my notes: -…
55
431
3K
0
0
2
@sumanthrh
Sumanth Hegde
5 months
[4/8] 4. Challenges with Tokenization: Challenges with integer tokenization, tokenization for non-English languages and going multilingual, with a focus on the recent No Language Left Behind (NLLB) effort from Meta.
1
0
2
@sumanthrh
Sumanth Hegde
5 months
[2/8] 1. Intro: A quick introduction on tokens and the different tokenization algorithms out there. 2. BPE: A closer look at the Byte-Pair Encoding tokenization algorithm. We'll also go over a minimal implementation for training a BPE model.
1
0
1
@sumanthrh
Sumanth Hegde
6 months
Is the Falcon-40b paper still not out? I'm curious about how they trained their tokenizer. Fun fact: All integers from 0-999 get segmented as 1 token, except 957, which gets 2 tokens (??).
0
0
2
@sumanthrh
Sumanth Hegde
7 months
@hazemessamm Good catch! Will correct. Thanks!
0
0
2
@sumanthrh
Sumanth Hegde
8 months
Went for a run after a long break. Struggled at the end, but after a quick glance around, I yelled "Who's gonna carry the boats?". Can confirm that's an instant 20% stamina boost.
0
0
2
@sumanthrh
Sumanth Hegde
1 year
Something to add to this: As put by @nntaleb , we as anticipation machines. Our ability to mentally project and contemplate different actions without performing them allows us to cheat evolution. The same circuits seem to be also interfering with our emotional health?
@brianpagan
Brian Pagán 🧠+💚
1 year
A "human mind is a wandering mind, and a wandering #mind is an unhappy mind." 🧠 This @ScienceMagazine study shows that we're happiest when we're present with what we're doing in the moment. #Science & #mindfulness via @hubermanlab
2
15
128
0
0
2
@sumanthrh
Sumanth Hegde
1 year
Some example creations are below. Jester is currently trained to make memes from 100 templates.
Tweet media one
1
0
2
@sumanthrh
Sumanth Hegde
7 months
@StasBekman Got it! This is very helpful. Thank you!!
1
0
2
@sumanthrh
Sumanth Hegde
5 months
@thesephist Np! Tbh I've been trying to find what I can use to visualize neurons/attention with in-context examples - Ex: which few shot example mattered more in the prompt (do you know any :p). Per-token view of BertViz crashes my notebook if you have hundreds of tokens (and with 100s of…
0
0
2
@sumanthrh
Sumanth Hegde
5 months
The only prerequisite assumed is a basic introduction to NLP/language models, such as @jeremyphoward 's LLM for hackers intro or @JayAlammar 's videos on LLM tokenizers. I've tried to create a complementary resource to @huggingface 's NLP course. Topics Covered:
1
0
2
@sumanthrh
Sumanth Hegde
4 months
@thesephist I'm afraid to admit I'm somewhere between 3 and 4 myself.
0
0
2
@sumanthrh
Sumanth Hegde
9 months
Gonna be world-class at watching Dark Knight Rises reruns on WB
Tweet media one
1
0
2
@sumanthrh
Sumanth Hegde
2 years
Sir Humphrey Appleby strikes again! Context:
@mattyglesias
Matthew Yglesias
2 years
Smokers have higher per-year medical expenses but they reduce Medicare costs on net by dying earlier.
Tweet media one
51
76
826
0
0
2
@sumanthrh
Sumanth Hegde
3 years
I find it hilarious that one of the few accounts with nuanced takes (at least for now) and wholesome tweets is a gummy bear.
0
0
2
@sumanthrh
Sumanth Hegde
5 years
Guest : Jimmy Fallon :
@nitarinDX
にたり🦈
5 years
最近の僕の癒し
188
34K
81K
0
0
1
@sumanthrh
Sumanth Hegde
7 months
Okay I tried Bard after quite some time, and now with extensions, and it's actually very good! I tried out some queries to get information from YouTube, Flights, Gmail and regular Search. It's very fast, and mostly works. From initial testing, some limitations/issues 🧵
@pararths
pararth
7 months
🚨Bard feature drop: Extensions! - connects with Google services - retrieves from, reasons over and composes APIs - powered by PaLM 2 First step in the long arc of shipping LLM-powered 'agents' in consumer-grade products Far from done but excited to share this with the 🌎 🧵
12
15
113
1
1
2
@sumanthrh
Sumanth Hegde
5 years
@pHequals7 @ewarren I think her main problem is with the fact that they operate the iOS app Store while also distributing their own apps in the platform.
0
0
0
@sumanthrh
Sumanth Hegde
5 months
(n/n) However, if the model first outputs English text, and then translates the answer to Hindi, it outperforms GPT-4 on (Hindi-translated) MT-bench - mainly because of Llama-2-7B's English-heavy pretraining. This is thinking in English, and speaking in Hindi!
0
0
0
@sumanthrh
Sumanth Hegde
1 year
How it works: We trained a Transformer model to classify user text and show 10 relevant templates. For each template, we labelled (manually!!) 5 example (user prompt, meme caption) pairs to make a custom prompt for GPT-3. Code and more details here:
1
0
1
@sumanthrh
Sumanth Hegde
5 months
@charles_irl Me after seeing the name Raskell
Tweet media one
0
0
1
@sumanthrh
Sumanth Hegde
5 months
(3/n) on GPT-4: - Too fine-grained a tokenization has downstream effects on performance. On the task of toxicity detection for different translations, Indic languages had the lowest peformance with Meta's No Language Left Behind models.
1
0
0
@sumanthrh
Sumanth Hegde
7 months
@StasBekman @huggingface Oh yes I almost forgot! Will update and mention it for completion sake!
0
0
1
@sumanthrh
Sumanth Hegde
3 years
@paulg on thinking the unthinkable: "If you can think things so outside the box that they'd make people's hair stand on end, you'll have no trouble with the small trips outside the box that people call innovative."
1
0
1
@sumanthrh
Sumanth Hegde
3 years
@pHequals7 @ShitUserStory Just use Lists. Everything's more organised and you can stay sane. Only issue is that you don't see the account's likes, which you can somewhat get around with more Lists :)
0
0
1
@sumanthrh
Sumanth Hegde
1 year
This was a course project under @yuqirose , and we had essentially 8 weeks. One thing I still want to do is to make the whole thing run on the browser - Streamlit is good but still slow, and concurrency is a question mark.
0
0
1
@sumanthrh
Sumanth Hegde
5 months
[5/8] 5. Puzzles: Some simple puzzles to get you thinking about pre-tokenization, vocabulary size, etc. We revisit @Thom_Wolf 's tokenizer puzzle on vocabulary size.
1
0
1
@sumanthrh
Sumanth Hegde
4 months
I wanna return to monke
@karpathy
Andrej Karpathy
4 months
Amazing text to music generations from @suno_ai_ , could easily see these taking over leaderboards. Personal favorite: this song I fished out of their Discord a few months ago, "Return to Monkey", which has been stuck in my head since :D [00:57] I wanna return to monkey, I…
156
260
2K
0
0
1
@sumanthrh
Sumanth Hegde
3 years
"Show me the incentive and I will show you the outcome" - Charlie Munger
@SteveStuWill
Steve Stewart-Williams
3 years
"Social media platforms like Twitter amplify expressions of moral outrage over time, because users learn such language gets rewarded with an increased number of 'likes' and 'shares,' a new Yale University study shows."
97
573
2K
0
0
1
@sumanthrh
Sumanth Hegde
5 months
(2/n) - Too fine-grained a tokenization is bad. You need large vocab sizes DEDICATED for that language. - Current models have large vocab sizes but little dedicated to languages like Hindi. GPT-4 has 100K vocab, but still gives 5 times more tokens for Hindi than English.
1
0
0
@sumanthrh
Sumanth Hegde
7 months
@TheZachMueller @huggingface Thanks @TheZachMueller ! Means a lot coming from folks like you! 🙂
0
0
1
@sumanthrh
Sumanth Hegde
7 months
@SuhasPai5 @kc_srk Nice! We're gonna be using Haskell!
0
0
1
@sumanthrh
Sumanth Hegde
3 years
Eerie similarity with Black Mirror's "Nosedive" episode. cc @jposhaughnessy @lukeburgis
@TheSwaddle
TheSwaddle
3 years
“My life depends on ratings and they’ve understood that." @romimacaronii
0
17
38
0
0
1
@sumanthrh
Sumanth Hegde
3 years
All my well-crafted career plans have just been rendered useless...
Tweet media one
0
0
1
@sumanthrh
Sumanth Hegde
5 months
@Suhail Perfect marketing strategy for @amazon
0
0
0
@sumanthrh
Sumanth Hegde
1 year
Nothing like a cold shower to make you recollect all the wonderful swear words you know.
0
0
1
@sumanthrh
Sumanth Hegde
1 year
@317070 @PaulTopping How is it that chatGPT can seemingly browse the internet? Get the right IP address for BBC, etc.
2
0
1
@sumanthrh
Sumanth Hegde
7 months
@itsclivetime @jeremyphoward Tbh Accelerate is meant to be the opposite (very general), and does a pretty good job of providing a unified interface to switch between different distributed training strategies. By default, there will be json/yaml hell involved because it wraps around FSDP and DeepSpeed. You…
1
0
1
@sumanthrh
Sumanth Hegde
7 months
Still going through @elonmusk 's biography and one of my favourite bits has to be when @WalterIsaacson summarizes 8 key life lessons, in the same style as The Algorithm, that @elonmusk , @kimbal , @shivon and @Grimezsz learnt, all from playing Polytopia.
0
0
1
@sumanthrh
Sumanth Hegde
3 years
@pHequals7 @SuhasPai5 @pranavcondur4 Waiting for @SuhasPai5 to say "I'm literally the guy in this tweet".
0
0
1
@sumanthrh
Sumanth Hegde
7 months
Google Flights and YouTube plugins seem to just work, at least for the current set of features. Sometimes you do have problems like incorrect retrieval (wrong flight price, for example), but overall pretty good and helpful!
1
0
1
@sumanthrh
Sumanth Hegde
3 years
@robkhenderson The Group Monkey Dance!
@robkhenderson
Rob Henderson
3 years
Meditations on Violence "In this ritual, members of a group compete for status and show their loyalty by how vicious they can be to an 'outsider.' Pleading, fighting, passivity will be interpreted as proof of 'otherness' and justification to escalate."
Tweet media one
10
69
465
0
0
0
@sumanthrh
Sumanth Hegde
3 years
0
0
1
@sumanthrh
Sumanth Hegde
3 years
Reminds me of what firefighter Paul Gleason said about his crew leadership: It's not decision making, it's sensemaking. "If I make a decision,...I take pride in it....and not listen to those who question it...If I make sense, then this is more dynamic...and I can change it."
@jposhaughnessy
Jim O'Shaughnessy
3 years
And it's a skill that will become more and more valuable as we emerge from the Great Reshuffle
1
0
15
1
0
1
@sumanthrh
Sumanth Hegde
1 year
"nuclear exchange".. that's like saying bomber aircrafts perform "combustible package delivery"
@Austen
Austen Allred
1 year
OK we may be getting just a touch ahead of ourselves here
Tweet media one
20
9
165
0
0
1