Amirkeivan Mohtashami @akmohtashami_a X Profile

Amirkeivan Mohtashami

@akmohtashami_a

Followers

186

Following

49

Media

4

Statuses

31

PhD - EPFL

Lausanne

Joined May 2023

Don't wanna be here? Send us removal request.

Saleh Ashkboos

@AshkboosSaleh

2 years

We are working on quantizing #Llama3 using #QuaRot. I got some interesting results on WikiText PPL (FP16 model): (Seq Len=2048) LLaMa2-7B: 5.47 LLaMa3-8B: 6.14 (Seq Len=4096) LLaMa2-7B: 5.11 LLaMa3-8B: 5.75 Maybe Wiki. PPL is not a great metric to report anymore!

Saleh Ashkboos

@AshkboosSaleh

2 years

[1/7] Happy to release 🥕QuaRot, a post-training quantization scheme that enables 4-bit inference of LLMs by removing the outlier features. With @akmohtashami_a @max_croci @DAlistarh @thoefler @jameshensman and others Paper: https://t.co/u3OMOyc78O Code: https://t.co/RsN34zmriI

0

2

17

Alex Hägele

@haeggee

2 years

If you haven't seen it yesterday, the Mixture-of-Depths is a really nice idea for dynamic compute I decided to quickly code down a MoD block in a small GPT and try it out -- if you want to play with it too (and check correctness pls!), the code is here: https://t.co/2RcEHmTyzU

Hassan Hayat 🔥

@TheSeaMouse

2 years

Why Google Deepmind's Mixture-of-Depths paper, and more generally dynamic compute methods, matter: Most of the compute is WASTED because not all tokens are equally hard to predict

5

51

229

Saleh Ashkboos

@AshkboosSaleh

2 years

[1/7] Happy to release 🥕QuaRot, a post-training quantization scheme that enables 4-bit inference of LLMs by removing the outlier features. With @akmohtashami_a @max_croci @DAlistarh @thoefler @jameshensman and others Paper: https://t.co/u3OMOyc78O Code: https://t.co/RsN34zmriI

7

63

304

Amirkeivan Mohtashami

@akmohtashami_a

2 years

Skip connections are not enough! We show that providing the individual outputs of previous layers to each Transformer layer significantly boosts its performance. See the thread for more! Had an amazing time collaborating with @MatPagliardini, @francoisfleuret, and Martin Jaggi.

Matteo Pagliardini

@MatPagliardini

2 years

A tweak in the architecture of #Transformers can significantly boost accuracy! With direct access to all previous blocks’ outputs, a 48-block #DenseFormer outperforms a 72-block Transformer, with faster inference! A work with @akmohtashami_a,@francoisfleuret, Martin Jaggi. 1/🧵

0

1

18

Google AI

@GoogleAI

2 years

People often teach one another by simply explaining a problem using natural language. Today we introduce an approach for model training wherein a teacher #LLM generates natural language instructions to train a student model with improved privacy. https://t.co/BlhsThCuJJ

35

177

711

Atli Kosson

@AtliKosson

2 years

Why does AdamW outperform Adam with L2-regularization? Its effectiveness seems to stem from how it affects the angular update size of weight vectors! This may also be the case for Weight Standardization, lr warmup and weight decay in general! 🧵 for https://t.co/D8i8u3fSsd 1/10

4

44

209

Jeremy Howard

@jeremyphoward

2 years

If you're a Python programmer looking to get started with CUDA, this weekend I'll be doing a free 1 hour tutorial on the absolute basics. Thanks to @neurosp1ke, @marksaroufim, and @ThomasViehmann for hosting this on the CUDA MODE server. :D Click here:

discord.com

Discord is great for playing games and chilling with friends, or even building a worldwide community. Customize your own space to talk, play, and hang out.

28

268

2K

Antoine Bosselut

@ABosselut

2 years

That's right folks -- back in action for one night only! Come hear me talk about #LLM reasoning through parameter updates.

Zeming Chen

@eric_zemingchen

2 years

Check out our #NeurIPS2023 paper, RECKONING, on reasoning as meta-learning. I am unable to attend due to visa issues, but my amazing advisor @ABosselut will present the poster! Time: Wed 13 Dec 5 p.m. CST — 7 p.m. CST Location: Great Hall & Hall B1+B2 (level 1)

0

3

18

Amirkeivan Mohtashami

@akmohtashami_a

2 years

I am presenting Landmark Attention today at 17:15 in #NeurIPS2023. I will also present CoTFormer ( https://t.co/ib5HEIZ8cr) in WANT workshop on Saturday. Excited to meet some of you at either.

Amirkeivan Mohtashami

@akmohtashami_a

2 years

Introducing Landmark Attention! Our method allows #transformers to handle any inference context length, regardless of their training context length. This enables #LLAMA7B to process contexts with 32k+ tokens—just like #GPT4. Read our paper: https://t.co/WD9FFBHRy4 🧵👇(1/3)

0

2

14

Helia (Helyaneh) Ziaei Jam

@HeliaZJ

2 years

Now out in @NatureComms! We developed EnsembleTR, an ensemble method to combine genotypes from 4 major tandem repeat callers and generated a genome-wide catalog of ~1.7 million TRs from 3550 samples in the 1000 Genomes and H3Africa cohorts. https://t.co/zFcBnLjAMr

nature.com

Nature Communications - Tandem repeats (TRs) comprise some of the most polymorphic regions of the human genome but are difficult to study. Here, the authors develop an ensemble-based genotyping...

5

10

50

Maksym Andriushchenko

@maksym_andr

2 years

🚨 I'm looking for a postdoc position to start in Fall 2024! My most recent research interests are related to understanding foundation models (especially LLMs!), making them more reliable, and developing principled methods for deep learning. More info:

9

42

155

Yann LeCun

@ylecun

2 years

The US is notorious for it's mass shootings, but its immigration policies (or lack thereof) set a new standard for the art of shooting yourself in the foot with a bazooka. Advice to graduate students from countries the US doesn't like: just go to Europe.

Masoud

@linguistMasoud

2 years

I’m quite used to the cruelty students can face when they apply for a US visa but this one broke me. We offered admission to a stellar, talented & hardworking student. After months of work and hundreds of dollars, an embassy officer saw him for 5 mins & said no. why? …

75

138

2K

Ahmad Beirami

@abeirami

2 years

Both @icmlconf and @NeurIPSConf held in the US in 2022-23! The US is one of the most visa unfriendly states (appointment wait times 6+ months, processing another 6+ months), this is significantly hurting diversity & inclusion. We should strive to do better! #ICML2023 #NeurIPS2023

8

22

151

Sadegh Farhadkhani

@Sadegh_Farhad

2 years

ICML 22&23, and NuerIPS 22&23, all have been/will be held in the US. I know it's not easy to organize a conference of this size. Yet I am really curious to know whether people who have difficulties traveling to the US were part of the equation for these decisions or not.

Ahmad Beirami

@abeirami

2 years

@sahandsharif @icmlconf While US Visa process leaves a lot to be desired, I don't think it constitutes racism. Having said that, it's sad that both @icmlconf and @NeurIPSConf are in the US this year, which is one of the most visa unfriendly states, significantly hurting diversity and inclusion.

0

3

5

Matteo Pagliardini

@MatPagliardini

2 years

How to speed up the training of transformers over large sequences? Many methods sparsify the attention matrix with static patterns. Could we use dynamic (e.g. adaptive) patterns? A thread! Joint work with @DanielePaliotta (equal contribution), @francoisfleuret , and Martin Jaggi

3

81

435

Konstantin Mishchenko

@konstmish

2 years

SGD in practice usually doesn't sample data uniformly and instead goes over the dataset in epochs, which is called Random Reshuffling. We've known for some time that RR is better than SGD for convex functions and now it's been proven for nonconvex: https://t.co/PwWIUrYG98

5

27

187

Amirkeivan Mohtashami

@akmohtashami_a

2 years

📢🚀 It's here! 💥👏 Just released the code for landmark attention! 🔗 Check it out on GitHub: https://t.co/JLcileP4Uy #Transformers #LLM #GPT4

github.com

Landmark Attention: Random-Access Infinite Context Length for Transformers - epfml/landmark-attention

Amirkeivan Mohtashami

@akmohtashami_a

2 years

Introducing Landmark Attention! Our method allows #transformers to handle any inference context length, regardless of their training context length. This enables #LLAMA7B to process contexts with 32k+ tokens—just like #GPT4. Read our paper: https://t.co/WD9FFBHRy4 🧵👇(1/3)

0

1

4

Maksym Andriushchenko

@maksym_andr

2 years

🚨Excited to share our new work “Sharpness-Aware Minimization Leads to Low-Rank Features” https://t.co/dHgdgWI5Ja! ❓We know SAM improves generalization, but can we better understand the structure of features learned by SAM? (with @dara_bahri, @TheGradient, N. Flammarion) 🧵1/n

4

24

151

Amirkeivan Mohtashami

@akmohtashami_a

2 years

Landmark Attention enables inference at any context length, irrespective of the training context length (hence the term “infinite” in the title). Moreover, it drastically reduces memory and compute requirements by a substantial factor equal to the block size (e.g., 50x) (3/3)

0

Amirkeivan Mohtashami

@akmohtashami_a

2 years

Our method utilizes landmark tokens to retrieve relevant blocks directly through the attention mechanism. This ensures that #transformers maintain their inherent capability to access any token in the context, while leveraging the landmarks for targeted block retrieval. (2/3)

1

0