Jan Ludziejewski Profile
Jan Ludziejewski

@jahulas

Followers
707
Following
203
Media
7
Statuses
30

AI Scientist at @MistralAI pretraining | MoE & Scaling Laws | PhD Student at the University of Warsaw

Joined July 2023
Don't wanna be here? Send us removal request.
@jahulas
Jan Ludziejewski
4 months
Excited to be at ICML next week presenting our paper Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient! If you want to talk about scaling laws and MoEs, or you're interested in pretraining at @MistralAI - hit me up.
1
5
28
@teortaxesTex
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
2 months
Interesting fact that as far as we can tell, MoEs are always compute optimal if evaluated over the lifetime. In the worst case, ie total param parity, you just need to scale the training data by total/active, giving you ≈the same training cost. The main issue is skill issue
@kalomaze
kalomaze
2 months
@teortaxesTex believe it or not, dense activations are not optimal at this size, either
1
2
25
@kuba_krj
Jakub Krajewski
4 months
I’m excited to share that I will be presenting the poster on Scaling Fine-Grained MoE to 50B Parameters today at the ES-FoMo ICML Workshop! 🚀 The report is also live on arXiv. This work was performed during my internship at @nvidia last year - I’m happy to finally share it! More
2
5
13
@MistralAI
Mistral AI
5 months
Announcing Magistral, our first reasoning model designed to excel in domain-specific, transparent, and multilingual reasoning.
107
453
3K
@jahulas
Jan Ludziejewski
9 months
I'm happy to share that I've joined @MistralAI as an AI Scientist in pretraining!
34
11
562
@jahulas
Jan Ludziejewski
9 months
(6/n) To perform our analyses, we derive a new joint scaling law for both dense and MoE models, expanding Chinchilla by the number of experts. We back it up empirically by an extensive grid of over 280 experiments with up to 2.7B active parameters and up to 5B total parameters.
1
0
20
@jahulas
Jan Ludziejewski
9 months
(5/n) For practitioners, we suggest the following Rule of Thumb: For instance, a compute-optimal 1.1B model trained for 8B tokens will have worse loss than a 4-expert, 1.1B total params MoE trained on 32B tokens. The MoE models will require fewer FLOPs per token during inference.
2
1
24
@jahulas
Jan Ludziejewski
9 months
(4/n) Furthermore, we show that MoE requires MoE-specific hyperparams to be optimal. Generally, models with a greater number of experts require: - Higher token-to-param ratio - Lower Learning Rate In our work, you can find a detailed guide on optimal tuning in specific scenarios.
1
0
23
@jahulas
Jan Ludziejewski
9 months
(3/n) Our experimental validation show, that a 1.1B dense model can be outperformed by 1.1B total parameters MoE trained with the same compute budget, which only activates a fraction of its parameters per token.
1
0
24
@jahulas
Jan Ludziejewski
9 months
(2/n) MoE models are known to be compute-optimal compared to dense models. Intuitively, they require more memory. This creates a false trade-off: save FLOPs at the cost of VRAM. Our new scaling laws challenge that assumption by comparing memory- and compute-matched models.
1
0
22
@jahulas
Jan Ludziejewski
9 months
(1/n) Introducing Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient We show that MoE is so compute-efficient that, under the same memory budget, it can beat a dense alternative!
2
43
164
@marek_a_cygan
Marek Cygan
1 year
Let me advertise the results of the LLM group at the University of Warsaw & IDEAS. I was happy to see the growing strength of the group, both in terms of the number of people engaged but also the quality of the project results. With the focus on Mixture of Experts in LLMs, this
0
7
39
@crewtool
Michał Krutul
1 year
Our work “Mixture of Tokens: Continuous MoE through Cross-Example Aggregation” has been accepted to NeurIPS 2024 main track! 🔥🔥🔥 Huge shoutout to the team: @Simontwice2 ,@S_Jaszczur @maciejpioro , @kuba_krj, @jahulas, @KamilCiebiera, @KrysKrol, @TOdrzygozdz, @marek_a_cygan.
2
9
27
@merettm
Jakub Pachocki
1 year
@marek_a_cygan AI postępuje szybko i będzie miało coraz większy wpływ na gospodarkę. Bardzo ważne są badania podstawowe, budowanie gruntu pod zrozumienie nowych technologii. IDEAS NCBR pod przewodnictwem Piotra Sankowskiego wyrabiało sobie pod tym względem świetną renomę na świecie - świadczą o
26
169
1K
@arankomatsuzaki
Aran Komatsuzaki
2 years
Scaling Laws for Fine-Grained Mixture of Experts - MoE models consistently outperform dense Transformers - The efficiency gap between dense and MoE models widens as we scale up the model size and training budget https://t.co/BnFe0EjgkN
4
63
278
@maciejpioro
Maciej Pióro
2 years
Tomorrow we will be presenting our work, *MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts* at the ME-FoMo ICLR workshop (Room Strauss 2). See you at the poster session at 1PM 🕐. You can also DM me if you wanna grab a coffee and talk about LLMs.
0
6
25
@jahulas
Jan Ludziejewski
2 years
Tomorrow I will be presenting our paper, *Scaling Laws for Fine-Grained Mixture of Experts* at the ME-FoMo ICLR workshop (Room Strauss 2). Come to our Spotlight Talk at 11AM and stop by the poster at 1PM. I am also happy to share that our paper has been accepted to ICML 2024 🎉
0
6
15
@arankomatsuzaki
Aran Komatsuzaki
2 years
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts Reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer https://t.co/JLMbbO9HsA
6
83
504