MustafaShukor1 Profile Banner
Mustafa Shukor Profile
Mustafa Shukor

@MustafaShukor1

Followers
704
Following
252
Media
78
Statuses
158

CS PhD @Sorbonne_Univ_

France
Joined September 2021
Don't wanna be here? Send us removal request.
@MustafaShukor1
Mustafa Shukor
6 months
We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵
10
81
456
@MustafaShukor1
Mustafa Shukor
19 days
L3M is finally released ! The codebase that we used to train AIMv2/v1 and conduct the scaling laws studies. It can also be used for different kinds of pretraining, including CLIP encoders and LLMs.
@victorturrisi
Victor Turrisi
19 days
Super excited to share l3m 🚀, a library for training large multimodal models, which we used to build AIM and AIMv2. Massive thanks to @alaa_nouby @DonkeyShot21 Michal Klein @MustafaShukor1 @jmsusskind and many others.
0
0
11
@MustafaShukor1
Mustafa Shukor
2 months
Our work on scaling laws for multimodal models and MoEs got an Oral at ICCV. Check it out !
@MustafaShukor1
Mustafa Shukor
6 months
We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵
2
21
142
@tokenbender
tokenbender
3 months
we missed a banger paper in the grok4/k2 drop noise guys. these guys > look for optimal ways to select data mixes to get max improvement on a model given a target domain. > do multimodal validation > show good extrapolation accuracy (testing on 1.4B and predicting on 8B)
11
74
787
@_akhaliq
AK
3 months
Scaling Laws for Optimal Data Mixtures
1
6
44
@danbusbridge
Dan Busbridge
3 months
Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's 🧵.
@MustafaShukor1
Mustafa Shukor
3 months
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
1
5
19
@nathanbenaich
Nathan Benaich
3 months
i love this kind of empirical research - i always ask about data mixtures bc i'm curious about what works and why, so here we have some insights!
@MustafaShukor1
Mustafa Shukor
3 months
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
0
1
6
@alaa_nouby
Alaa El-Nouby
3 months
Deciding which data mixture to use has always been such a crucial part for nailing a good pre-training recipe. Check out this paper, led by @PierreAblin , @MustafaShukor1 and the team at Apple MLR, providing a principled way for selecting optimal data mixture weights!
@MustafaShukor1
Mustafa Shukor
3 months
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
0
4
58
@jramapuram
Jason Ramapuram
3 months
Data mixing ratios are critical for modern LLM training. This work takes a first principles approach and develops scaling laws for the mixing ratios, enabling “train small” -> “get guarantees at scale”. Definitely worth a read.
@MustafaShukor1
Mustafa Shukor
3 months
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
0
2
14
@MustafaShukor1
Mustafa Shukor
3 months
Besides looking at the training loss, we report the performance on downstream tasks. We evaluate a 7B LLM trained with the best mixture (of 7 domains), predicted based on small-scale experiments (less than 1.5B params and 50B tokens), 8/n
1
0
5
@MustafaShukor1
Mustafa Shukor
3 months
To reduce the number of experiments, we used constant learning rates. However, we also validate our laws when with the commonly used cosine learning scheduler, 7/n
1
0
4
@MustafaShukor1
Mustafa Shukor
3 months
Only few runs are needed to fit the scaling laws. This significantly reduces the number of experiments needed to find the optimal training mixture (usually done with trials and errors), 6/n
1
0
4
@MustafaShukor1
Mustafa Shukor
3 months
Our laws perfectly predict the model loss for the 3 different setups (LLMs, NMMs, LVMs), 5/n
1
0
3
@MustafaShukor1
Mustafa Shukor
3 months
Specifically, we modify the Chinchilla scaling law to account for the data mixture, and propose two laws; (1) the additive, and (2) the joint laws. In the former, the optimal mixture is independent of FLOPs, while in the latter is not, 4/n
1
1
7
@MustafaShukor1
Mustafa Shukor
3 months
(2) to predict the optimal data mixture, given a FLOPs budget (N, D) 3/n
1
0
6
@MustafaShukor1
Mustafa Shukor
3 months
(1) to predict the model performance, before any training, given a model size N, dataset size T, and training data mixture h (here on mix of multimodal data domains) 2/n
1
0
6
@MustafaShukor1
Mustafa Shukor
3 months
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵
6
49
268
@_akhaliq
AK
4 months
Hugging Face presents SmolVLA A Vision-Language-Action Model for Affordable and Efficient Robotics
6
59
301
@MustafaShukor1
Mustafa Shukor
4 months
Blog post:
Tweet card summary image
huggingface.co
@MustafaShukor1
Mustafa Shukor
4 months
The Worldwide @LeRobotHF hackathon is in 2 weeks, and we have been cooking something for you… Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵
0
4
26
@liderarmente
Javier Martin
4 months
A corto plazo es difícil prever las implicaciones de esto, pero cuando te fijas en todo lo que están haciendo empresas como @huggingface y @nvidia para llevar la IA al mundo físico se te quitan todas las dudas al respecto de por qué la robótica es la próxima gran revolución
@MustafaShukor1
Mustafa Shukor
4 months
The Worldwide @LeRobotHF hackathon is in 2 weeks, and we have been cooking something for you… Introducing SmolVLA, a Vision-Language-Action model with light-weight architecture, pretrained on community datasets, with an asynchronous inference stack, to control robots🧵
0
3
7