danbusbridge Profile Banner
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦ Profile
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦

@danbusbridge

Followers
850
Following
1K
Media
34
Statuses
164

Machine Learning Research @ Apple (opinions are my own)

London, United Kingdom
Joined November 2014
Don't wanna be here? Send us removal request.
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
5 months
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering:. "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?". Our distillation scaling law shows, well, it's complicated. 🧡.
arxiv.org
We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks...
12
147
1K
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
4 days
Happing now in East Exhibition Hall E-2310, with @AmitisShidani1, looking forward to discussing our work!
Tweet media one
1
1
10
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
5 days
Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's 🧡.
@MustafaShukor1
Mustafa Shukor
6 days
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders !. Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧡
Tweet media one
1
5
19
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
7 days
Happening in 30 minutes in West Ballroom A - looking forward to sharing our work on Distillation Scaling Laws!
Tweet media one
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
9 days
Excited to be heading to Vancouver for #ICML2025 next week!. I'll be giving a deep dive on Distillation Scaling Laws at the expo β€” exploring when and how small models can match the performance of large ones. πŸ“ Sunday, July 13, 5pm, West Ballroom A.πŸ”— 
1
8
102
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
9 days
@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA @LouisBAlgue @PierreAblin Here's an Apple@ICML guide with all our talks, posters, and booth events:.πŸ”—  Come say hi if you're around, always happy to chat. Looking forward to a week of great research, and catching up with familiar faces (and meeting new ones too).
machinelearning.apple.com
Apple is presenting new research at the International Conference on Machine Learning (ICML 2025), which takes place in person in Vancouver…
0
1
5
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
9 days
@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA and Scaling Laws for Forgetting and Fine-Tuning (E-2708).with @LouisBAlgue, David Grangier, Eleonora Gualdoni, Marco Cuturi, and @PierreAblin.πŸ”— 
1
1
3
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
9 days
@AmitisShidani1 Also lucky to be co-authoring two more posters during the same session with my awesome colleagues:. Parameters vs FLOPs for MoEs (E-2810).with @samira_abnar, @harshays_, @alaa_nouby, Josh Susskind, and @AggieInCA.πŸ”— 
1
1
1
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
9 days
Also presenting this work as a poster with @AmitisShidani1:.πŸ“ Wednesday, 11am, East Exhibition Hall A-B (E-2310).πŸ”— 
1
0
2
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
9 days
Distillation has long promised smaller, faster, more efficient models β€” but its scaling behaviour is still poorly understood. We present a new distillation scaling law that helps turn this black-box art into more predictable science.
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
5 months
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering:. "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?". Our distillation scaling law shows, well, it's complicated. 🧡.
1
0
2
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
9 days
Excited to be heading to Vancouver for #ICML2025 next week!. I'll be giving a deep dive on Distillation Scaling Laws at the expo β€” exploring when and how small models can match the performance of large ones. πŸ“ Sunday, July 13, 5pm, West Ballroom A.πŸ”— 
3
4
28
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
3 months
RT @jramapuram: Stop by poster #596 at 10A-1230P tomorrow (Fri 25 April) at #ICLR2025 to hear more about Sigmoid Attention! . We just pushe….
0
14
0
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
3 months
I’ve been curious about how early vs late-fusion multimodal approaches compare in controlled conditions. Great to see this studied in depth. Turns out, optimal late fusion has higher params-to-data, and performance between early and late fusion is similar. Brilliant work from.
@MustafaShukor1
Mustafa Shukor
3 months
We release a large scale study to answer the following:.- Is late fusion inherently better than early fusion for multimodal models?.- How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧡
Tweet media one
1
8
41
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
5 months
RT @EeshanDhekane: Parameterized Transforms πŸš€. Here is a new tool that provides a modular and extendable implementation of torchvision-base….
0
9
0
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
5 months
We do observe that students improve with longer distillation training (i.e., patient teaching works). Additionally, with particularly long distillation durations, we approach what supervised learning can achieve (as limited by model capacity, in our experimental setting).
0
0
3
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
5 months
In contrast, our teachers and students are trained on the same data distribution, and we compare with supervised models that can access the same distribution. This lets us to make statements about what distillation can do given access to the same resources.
1
0
3
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
5 months
I.e. Beyer et al's students do not see the teacher training distribution, and there is no supervised baseline where, for example, a model has access to both INet21k and the smaller datasets. This type of comparison was not the focus of their work.
1
0
1
@danbusbridge
Dan Busbridge @ ICML Vancouver πŸ‡¨πŸ‡¦
5 months
Beyer et al.'s teachers are trained on a large, diverse dataset (e.g., INet21k) then fine-tuned for the target datasets (e.g., Flowers102 or ImageNet1k). Students are distilled on the target datasets and only access the teacher's training distribution indirectly.
1
0
1