Jan Ludziejewski @jahulas X Profile

Jan Ludziejewski

@jahulas

Followers

707

Following

203

Media

7

Statuses

30

AI Scientist at @MistralAI pretraining | MoE & Scaling Laws | PhD Student at the University of Warsaw

Joined July 2023

Don't wanna be here? Send us removal request.

Jan Ludziejewski

@jahulas

4 months

Excited to be at ICML next week presenting our paper Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient! If you want to talk about scaling laws and MoEs, or you're interested in pretraining at @MistralAI - hit me up.

1

5

28

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxesTex

2 months

Interesting fact that as far as we can tell, MoEs are always compute optimal if evaluated over the lifetime. In the worst case, ie total param parity, you just need to scale the training data by total/active, giving you ≈the same training cost. The main issue is skill issue

kalomaze

@kalomaze

2 months

@teortaxesTex believe it or not, dense activations are not optimal at this size, either

1

2

25

Jakub Krajewski

@kuba_krj

4 months

I’m excited to share that I will be presenting the poster on Scaling Fine-Grained MoE to 50B Parameters today at the ES-FoMo ICML Workshop! 🚀 The report is also live on arXiv. This work was performed during my internship at @nvidia last year - I’m happy to finally share it! More

2

5

13

Jan Ludziejewski

@jahulas

4 months

@MistralAI Thanks to my team! @maciejpioro @kuba_krj @StefaniakMaciej @crewtool @ja321ma @marek_a_cygan @piotrsankowski @KamilAdamczews5 @PiotrRMilos @S_Jaszczur Link to the paper:

arxiv.org

Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their...

0

8

Mistral AI

@MistralAI

5 months

Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.

107

453

3K

Jan Ludziejewski

@jahulas

9 months

I'm happy to share that I've joined @MistralAI as an AI Scientist in pretraining!

34

11

562

Jan Ludziejewski

@jahulas

9 months

(n/n) Shout-out to my core co-authors: @maciejpioro, @kuba_krj and our team: @StefaniakMaciej, @crewtool, @j321m, @marek_a_cygan, @piotrsankowski, @KamilAdamczews5, @PiotrRMilos, led by @S_Jaszczur. Link to the paper:

arxiv.org

Mixture of Experts (MoE) architectures have significantly increased computational efficiency in both research and real-world applications of large-scale machine learning models. However, their...

2

0

28

Jan Ludziejewski

@jahulas

9 months

(6/n) To perform our analyses, we derive a new joint scaling law for both dense and MoE models, expanding Chinchilla by the number of experts. We back it up empirically by an extensive grid of over 280 experiments with up to 2.7B active parameters and up to 5B total parameters.

1

0

20

Jan Ludziejewski

@jahulas

9 months

(5/n) For practitioners, we suggest the following Rule of Thumb: For instance, a compute-optimal 1.1B model trained for 8B tokens will have worse loss than a 4-expert, 1.1B total params MoE trained on 32B tokens. The MoE models will require fewer FLOPs per token during inference.

2

1

24

Jan Ludziejewski

@jahulas

9 months

(4/n) Furthermore, we show that MoE requires MoE-specific hyperparams to be optimal. Generally, models with a greater number of experts require: - Higher token-to-param ratio - Lower Learning Rate In our work, you can find a detailed guide on optimal tuning in specific scenarios.

1

0

23

Jan Ludziejewski

@jahulas

9 months

(3/n) Our experimental validation show, that a 1.1B dense model can be outperformed by 1.1B total parameters MoE trained with the same compute budget, which only activates a fraction of its parameters per token.

1

0

24

Jan Ludziejewski

@jahulas

9 months

(2/n) MoE models are known to be compute-optimal compared to dense models. Intuitively, they require more memory. This creates a false trade-off: save FLOPs at the cost of VRAM. Our new scaling laws challenge that assumption by comparing memory- and compute-matched models.

1

0

22

Jan Ludziejewski

@jahulas

9 months

(1/n) Introducing Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient We show that MoE is so compute-efficient that, under the same memory budget, it can beat a dense alternative!

2

43

164

Marek Cygan

@marek_a_cygan

1 year

Let me advertise the results of the LLM group at the University of Warsaw & IDEAS. I was happy to see the growing strength of the group, both in terms of the number of people engaged but also the quality of the project results. With the focus on Mixture of Experts in LLMs, this

0

7

39

Michał Krutul

@crewtool

1 year

Our work “Mixture of Tokens: Continuous MoE through Cross-Example Aggregation” has been accepted to NeurIPS 2024 main track! 🔥🔥🔥 Huge shoutout to the team: @Simontwice2 ,@S_Jaszczur @maciejpioro , @kuba_krj, @jahulas, @KamilCiebiera, @KrysKrol, @TOdrzygozdz, @marek_a_cygan.

2

9

27

Jakub Pachocki

@merettm

1 year

@marek_a_cygan AI postępuje szybko i będzie miało coraz większy wpływ na gospodarkę. Bardzo ważne są badania podstawowe, budowanie gruntu pod zrozumienie nowych technologii. IDEAS NCBR pod przewodnictwem Piotra Sankowskiego wyrabiało sobie pod tym względem świetną renomę na świecie - świadczą o

26

169

1K

Aran Komatsuzaki

@arankomatsuzaki

2 years

Scaling Laws for Fine-Grained Mixture of Experts - MoE models consistently outperform dense Transformers - The efficiency gap between dense and MoE models widens as we scale up the model size and training budget https://t.co/BnFe0EjgkN

4

63

278

Maciej Pióro

@maciejpioro

2 years

Tomorrow we will be presenting our work, *MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts* at the ME-FoMo ICLR workshop (Room Strauss 2). See you at the poster session at 1PM 🕐. You can also DM me if you wanna grab a coffee and talk about LLMs.

0

6

25

Jan Ludziejewski

@jahulas

2 years

Tomorrow I will be presenting our paper, *Scaling Laws for Fine-Grained Mixture of Experts* at the ME-FoMo ICLR workshop (Room Strauss 2). Come to our Spotlight Talk at 11AM and stop by the poster at 1PM. I am also happy to share that our paper has been accepted to ICML 2024 🎉

0

6

15

Aran Komatsuzaki

@arankomatsuzaki

2 years

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts Reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer https://t.co/JLMbbO9HsA

6

83

504