Amirkeivan Mohtashami
@akmohtashami_a
Followers
186
Following
49
Media
4
Statuses
31
We are working on quantizing #Llama3 using #QuaRot. I got some interesting results on WikiText PPL (FP16 model): (Seq Len=2048) LLaMa2-7B: 5.47 LLaMa3-8B: 6.14 (Seq Len=4096) LLaMa2-7B: 5.11 LLaMa3-8B: 5.75 Maybe Wiki. PPL is not a great metric to report anymore!
[1/7] Happy to release 🥕QuaRot, a post-training quantization scheme that enables 4-bit inference of LLMs by removing the outlier features. With @akmohtashami_a @max_croci @DAlistarh @thoefler @jameshensman and others Paper: https://t.co/u3OMOyc78O Code: https://t.co/RsN34zmriI
0
2
17
If you haven't seen it yesterday, the Mixture-of-Depths is a really nice idea for dynamic compute I decided to quickly code down a MoD block in a small GPT and try it out -- if you want to play with it too (and check correctness pls!), the code is here: https://t.co/2RcEHmTyzU
Why Google Deepmind's Mixture-of-Depths paper, and more generally dynamic compute methods, matter: Most of the compute is WASTED because not all tokens are equally hard to predict
5
51
229
[1/7] Happy to release 🥕QuaRot, a post-training quantization scheme that enables 4-bit inference of LLMs by removing the outlier features. With @akmohtashami_a @max_croci @DAlistarh @thoefler @jameshensman and others Paper: https://t.co/u3OMOyc78O Code: https://t.co/RsN34zmriI
7
63
304
Skip connections are not enough! We show that providing the individual outputs of previous layers to each Transformer layer significantly boosts its performance. See the thread for more! Had an amazing time collaborating with @MatPagliardini, @francoisfleuret, and Martin Jaggi.
A tweak in the architecture of #Transformers can significantly boost accuracy! With direct access to all previous blocks’ outputs, a 48-block #DenseFormer outperforms a 72-block Transformer, with faster inference! A work with @akmohtashami_a,@francoisfleuret, Martin Jaggi. 1/🧵
0
1
18
People often teach one another by simply explaining a problem using natural language. Today we introduce an approach for model training wherein a teacher #LLM generates natural language instructions to train a student model with improved privacy. https://t.co/BlhsThCuJJ
35
177
711
Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! đź§µ for https://t.co/D8i8u3fSsd 1/10
4
44
209
If you're a Python programmer looking to get started with CUDA, this weekend I'll be doing a free 1 hour tutorial on the absolute basics. Thanks to @neurosp1ke, @marksaroufim, and @ThomasViehmann for hosting this on the CUDA MODE server. :D Click here:
discord.com
Discord is great for playing games and chilling with friends, or even building a worldwide community. Customize your own space to talk, play, and hang out.
28
268
2K
That's right folks -- back in action for one night only! Come hear me talk about #LLM reasoning through parameter updates.
Check out our #NeurIPS2023 paper, RECKONING, on reasoning as meta-learning. I am unable to attend due to visa issues, but my amazing advisor @ABosselut will present the poster! Time: Wed 13 Dec 5 p.m. CST — 7 p.m. CST Location: Great Hall & Hall B1+B2 (level 1)
0
3
18
I am presenting Landmark Attention today at 17:15 in #NeurIPS2023. I will also present CoTFormer ( https://t.co/ib5HEIZ8cr) in WANT workshop on Saturday. Excited to meet some of you at either.
Introducing Landmark Attention! Our method allows #transformers to handle any inference context length, regardless of their training context length. This enables #LLAMA7B to process contexts with 32k+ tokens—just like #GPT4. Read our paper: https://t.co/WD9FFBHRy4 🧵👇(1/3)
0
2
14
Now out in @NatureComms! We developed EnsembleTR, an ensemble method to combine genotypes from 4 major tandem repeat callers and generated a genome-wide catalog of ~1.7 million TRs from 3550 samples in the 1000 Genomes and H3Africa cohorts. https://t.co/zFcBnLjAMr
nature.com
Nature Communications - Tandem repeats (TRs) comprise some of the most polymorphic regions of the human genome but are difficult to study. Here, the authors develop an ensemble-based genotyping...
5
10
50
🚨 I'm looking for a postdoc position to start in Fall 2024! My most recent research interests are related to understanding foundation models (especially LLMs!), making them more reliable, and developing principled methods for deep learning. More info:
9
42
155
The US is notorious for it's mass shootings, but its immigration policies (or lack thereof) set a new standard for the art of shooting yourself in the foot with a bazooka. Advice to graduate students from countries the US doesn't like: just go to Europe.
I’m quite used to the cruelty students can face when they apply for a US visa but this one broke me. We offered admission to a stellar, talented & hardworking student. After months of work and hundreds of dollars, an embassy officer saw him for 5 mins & said no. why? …
75
138
2K
Both @icmlconf and @NeurIPSConf held in the US in 2022-23! The US is one of the most visa unfriendly states (appointment wait times 6+ months, processing another 6+ months), this is significantly hurting diversity & inclusion. We should strive to do better! #ICML2023 #NeurIPS2023
8
22
151
ICML 22&23, and NuerIPS 22&23, all have been/will be held in the US. I know it's not easy to organize a conference of this size. Yet I am really curious to know whether people who have difficulties traveling to the US were part of the equation for these decisions or not.
@sahandsharif @icmlconf While US Visa process leaves a lot to be desired, I don't think it constitutes racism. Having said that, it's sad that both @icmlconf and @NeurIPSConf are in the US this year, which is one of the most visa unfriendly states, significantly hurting diversity and inclusion.
0
3
5
How to speed up the training of transformers over large sequences? Many methods sparsify the attention matrix with static patterns. Could we use dynamic (e.g. adaptive) patterns? A thread! Joint work with @DanielePaliotta (equal contribution), @francoisfleuret , and Martin Jaggi
3
81
435
SGD in practice usually doesn't sample data uniformly and instead goes over the dataset in epochs, which is called Random Reshuffling. We've known for some time that RR is better than SGD for convex functions and now it's been proven for nonconvex: https://t.co/PwWIUrYG98
5
27
187
📢🚀 It's here! 💥👏 Just released the code for landmark attention! 🔗 Check it out on GitHub: https://t.co/JLcileP4Uy
#Transformers #LLM #GPT4
github.com
Landmark Attention: Random-Access Infinite Context Length for Transformers - epfml/landmark-attention
Introducing Landmark Attention! Our method allows #transformers to handle any inference context length, regardless of their training context length. This enables #LLAMA7B to process contexts with 32k+ tokens—just like #GPT4. Read our paper: https://t.co/WD9FFBHRy4 🧵👇(1/3)
0
1
4
🚨Excited to share our new work “Sharpness-Aware Minimization Leads to Low-Rank Features” https://t.co/dHgdgWI5Ja! ❓We know SAM improves generalization, but can we better understand the structure of features learned by SAM? (with @dara_bahri, @TheGradient, N. Flammarion) 🧵1/n
4
24
151
Landmark Attention enables inference at any context length, irrespective of the training context length (hence the term “infinite” in the title). Moreover, it drastically reduces memory and compute requirements by a substantial factor equal to the block size (e.g., 50x) (3/3)
0
0
0
Our method utilizes landmark tokens to retrieve relevant blocks directly through the attention mechanism. This ensures that #transformers maintain their inherent capability to access any token in the context, while leveraging the landmarks for targeted block retrieval. (2/3)
1
0
0