
Mustafa Shukor
@MustafaShukor1
Followers
704
Following
252
Media
78
Statuses
158
CS PhD @Sorbonne_Univ_
France
Joined September 2021
We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵
10
81
456
L3M is finally released ! The codebase that we used to train AIMv2/v1 and conduct the scaling laws studies. It can also be used for different kinds of pretraining, including CLIP encoders and LLMs.
Super excited to share l3m 🚀, a library for training large multimodal models, which we used to build AIM and AIMv2. Massive thanks to @alaa_nouby @DonkeyShot21 Michal Klein @MustafaShukor1 @jmsusskind and many others.
0
0
11
Our work on scaling laws for multimodal models and MoEs got an Oral at ICCV. Check it out !
We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵
2
21
142
we missed a banger paper in the grok4/k2 drop noise guys. these guys > look for optimal ways to select data mixes to get max improvement on a model given a target domain. > do multimodal validation > show good extrapolation accuracy (testing on 1.4B and predicting on 8B)
11
74
787
Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's 🧵.
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
1
5
19
i love this kind of empirical research - i always ask about data mixtures bc i'm curious about what works and why, so here we have some insights!
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
0
1
6
Deciding which data mixture to use has always been such a crucial part for nailing a good pre-training recipe. Check out this paper, led by @PierreAblin , @MustafaShukor1 and the team at Apple MLR, providing a principled way for selecting optimal data mixture weights!
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
0
4
58
Data mixing ratios are critical for modern LLM training. This work takes a first principles approach and develops scaling laws for the mixing ratios, enabling “train small” -> “get guarantees at scale”. Definitely worth a read.
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
0
2
14
This is due to an amazing collaboration with @PierreAblin, @LouisBAlgue, @danbusbridge, @GrangierDavid, @DonkeyShot21 and @alaa_nouby Paper:
arxiv.org
Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard...
0
2
15
Besides looking at the training loss, we report the performance on downstream tasks. We evaluate a 7B LLM trained with the best mixture (of 7 domains), predicted based on small-scale experiments (less than 1.5B params and 50B tokens), 8/n
1
0
5
To reduce the number of experiments, we used constant learning rates. However, we also validate our laws when with the commonly used cosine learning scheduler, 7/n
1
0
4
Only few runs are needed to fit the scaling laws. This significantly reduces the number of experiments needed to find the optimal training mixture (usually done with trials and errors), 6/n
1
0
4
Our laws perfectly predict the model loss for the 3 different setups (LLMs, NMMs, LVMs), 5/n
1
0
3
Specifically, we modify the Chinchilla scaling law to account for the data mixture, and propose two laws; (1) the additive, and (2) the joint laws. In the former, the optimal mixture is independent of FLOPs, while in the latter is not, 4/n
1
1
7
(2) to predict the optimal data mixture, given a FLOPs budget (N, D) 3/n
1
0
6
(1) to predict the model performance, before any training, given a model size N, dataset size T, and training data mixture h (here on mix of multimodal data domains) 2/n
1
0
6
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
6
49
268
Hugging Face presents SmolVLA A Vision-Language-Action Model for Affordable and Efficient Robotics
6
59
301
Blog post:
huggingface.co
The Worldwide @LeRobotHF hackathon is in 2 weeks, and we have been cooking something for you… Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵
0
4
26
A corto plazo es difícil prever las implicaciones de esto, pero cuando te fijas en todo lo que están haciendo empresas como @huggingface y @nvidia para llevar la IA al mundo físico se te quitan todas las dudas al respecto de por qué la robótica es la próxima gran revolución
The Worldwide @LeRobotHF hackathon is in 2 weeks, and we have been cooking something for you… Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵
0
3
7