Mustafa Shukor @MustafaShukor1 X Profile

Mustafa Shukor

@MustafaShukor1

Followers

704

Following

252

Media

78

Statuses

158

CS PhD @Sorbonne_Univ_

https://t.co/3c5YzGTUgN

France

Joined September 2021

Don't wanna be here? Send us removal request.

Mustafa Shukor

@MustafaShukor1

6 months

We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵

10

81

456

Mustafa Shukor

@MustafaShukor1

19 days

L3M is finally released ! The codebase that we used to train AIMv2/v1 and conduct the scaling laws studies. It can also be used for different kinds of pretraining, including CLIP encoders and LLMs.

Victor Turrisi

@victorturrisi

19 days

Super excited to share l3m 🚀, a library for training large multimodal models, which we used to build AIM and AIMv2. Massive thanks to @alaa_nouby @DonkeyShot21 Michal Klein @MustafaShukor1 @jmsusskind and many others.

0

11

Mustafa Shukor

@MustafaShukor1

2 months

Our work on scaling laws for multimodal models and MoEs got an Oral at ICCV. Check it out !

Mustafa Shukor

@MustafaShukor1

6 months

We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵

2

21

142

tokenbender

@tokenbender

3 months

we missed a banger paper in the grok4/k2 drop noise guys. these guys > look for optimal ways to select data mixes to get max improvement on a model given a target domain. > do multimodal validation > show good extrapolation accuracy (testing on 1.4B and predicting on 8B)

11

74

787

AK

@_akhaliq

3 months

Scaling Laws for Optimal Data Mixtures

1

6

44

Dan Busbridge

@danbusbridge

3 months

Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's 🧵.

Mustafa Shukor

@MustafaShukor1

3 months

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

1

5

19

Nathan Benaich

@nathanbenaich

3 months

i love this kind of empirical research - i always ask about data mixtures bc i'm curious about what works and why, so here we have some insights!

Mustafa Shukor

@MustafaShukor1

3 months

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

0

1

6

Alaa El-Nouby

@alaa_nouby

3 months

Deciding which data mixture to use has always been such a crucial part for nailing a good pre-training recipe. Check out this paper, led by @PierreAblin , @MustafaShukor1 and the team at Apple MLR, providing a principled way for selecting optimal data mixture weights!

Mustafa Shukor

@MustafaShukor1

3 months

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

0

4

58

Jason Ramapuram

@jramapuram

3 months

Data mixing ratios are critical for modern LLM training. This work takes a first principles approach and develops scaling laws for the mixing ratios, enabling “train small” -> “get guarantees at scale”. Definitely worth a read.

Mustafa Shukor

@MustafaShukor1

3 months

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

0

2

14

Mustafa Shukor

@MustafaShukor1

3 months

This is due to an amazing collaboration with @PierreAblin, @LouisBAlgue, @danbusbridge, @GrangierDavid, @DonkeyShot21 and @alaa_nouby Paper:

arxiv.org

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard...

0

2

15

Mustafa Shukor

@MustafaShukor1

3 months

Besides looking at the training loss, we report the performance on downstream tasks. We evaluate a 7B LLM trained with the best mixture (of 7 domains), predicted based on small-scale experiments (less than 1.5B params and 50B tokens), 8/n

1

0

5

Mustafa Shukor

@MustafaShukor1

3 months

To reduce the number of experiments, we used constant learning rates. However, we also validate our laws when with the commonly used cosine learning scheduler, 7/n

1

0

4

Mustafa Shukor

@MustafaShukor1

3 months

Only few runs are needed to fit the scaling laws. This significantly reduces the number of experiments needed to find the optimal training mixture (usually done with trials and errors), 6/n

1

0

4

Mustafa Shukor

@MustafaShukor1

3 months

Our laws perfectly predict the model loss for the 3 different setups (LLMs, NMMs, LVMs), 5/n

1

0

3

Mustafa Shukor

@MustafaShukor1

3 months

Specifically, we modify the Chinchilla scaling law to account for the data mixture, and propose two laws; (1) the additive, and (2) the joint laws. In the former, the optimal mixture is independent of FLOPs, while in the latter is not, 4/n

1

7

Mustafa Shukor

@MustafaShukor1

3 months

(2) to predict the optimal data mixture, given a FLOPs budget (N, D) 3/n

1

0

6

Mustafa Shukor

@MustafaShukor1

3 months

(1) to predict the model performance, before any training, given a model size N, dataset size T, and training data mixture h (here on mix of multimodal data domains) 2/n

1

0

6

Mustafa Shukor

@MustafaShukor1

3 months

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

6

49

268

AK

@_akhaliq

4 months

Hugging Face presents SmolVLA A Vision-Language-Action Model for Affordable and Efficient Robotics

6

59

301

Mustafa Shukor

@MustafaShukor1

4 months

Blog post:

huggingface.co

Mustafa Shukor

@MustafaShukor1

4 months

The Worldwide @LeRobotHF hackathon is in 2 weeks, and we have been cooking something for you… Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵

0

4

26

Javier Martin

@liderarmente

4 months

A corto plazo es difícil prever las implicaciones de esto, pero cuando te fijas en todo lo que están haciendo empresas como @huggingface y @nvidia para llevar la IA al mundo físico se te quitan todas las dudas al respecto de por qué la robótica es la próxima gran revolución

Mustafa Shukor

@MustafaShukor1

4 months

The Worldwide @LeRobotHF hackathon is in 2 weeks, and we have been cooking something for you… Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵

0

3

7