
Dan Busbridge @ ICML Vancouver π¨π¦
@danbusbridge
Followers
850
Following
1K
Media
34
Statuses
164
Machine Learning Research @ Apple (opinions are my own)
London, United Kingdom
Joined November 2014
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering:. "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?". Our distillation scaling law shows, well, it's complicated. π§΅.
arxiv.org
We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks...
12
147
1K
Happing now in East Exhibition Hall E-2310, with @AmitisShidani1, looking forward to discussing our work!
1
1
10
Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's π§΅.
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders !. Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n π§΅
1
5
19
Happening in 30 minutes in West Ballroom A - looking forward to sharing our work on Distillation Scaling Laws!
Excited to be heading to Vancouver for #ICML2025 next week!. I'll be giving a deep dive on Distillation Scaling Laws at the expo β exploring when and how small models can match the performance of large ones. π Sunday, July 13, 5pm, West Ballroom A.π
1
8
102
@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA @LouisBAlgue @PierreAblin Here's an Apple@ICML guide with all our talks, posters, and booth events:.π Come say hi if you're around, always happy to chat. Looking forward to a week of great research, and catching up with familiar faces (and meeting new ones too).
machinelearning.apple.com
Apple is presenting new research at the International Conference on Machine Learning (ICML 2025), which takes place in person in Vancouverβ¦
0
1
5
@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA and Scaling Laws for Forgetting and Fine-Tuning (E-2708).with @LouisBAlgue, David Grangier, Eleonora Gualdoni, Marco Cuturi, and @PierreAblin.π
1
1
3
@AmitisShidani1 Also lucky to be co-authoring two more posters during the same session with my awesome colleagues:. Parameters vs FLOPs for MoEs (E-2810).with @samira_abnar, @harshays_, @alaa_nouby, Josh Susskind, and @AggieInCA.π
1
1
1
Also presenting this work as a poster with @AmitisShidani1:.π Wednesday, 11am, East Exhibition Hall A-B (E-2310).π
1
0
2
Distillation has long promised smaller, faster, more efficient models β but its scaling behaviour is still poorly understood. We present a new distillation scaling law that helps turn this black-box art into more predictable science.
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering:. "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?". Our distillation scaling law shows, well, it's complicated. π§΅.
1
0
2
Excited to be heading to Vancouver for #ICML2025 next week!. I'll be giving a deep dive on Distillation Scaling Laws at the expo β exploring when and how small models can match the performance of large ones. π Sunday, July 13, 5pm, West Ballroom A.π
3
4
28
RT @jramapuram: Stop by poster #596 at 10A-1230P tomorrow (Fri 25 April) at #ICLR2025 to hear more about Sigmoid Attention! . We just pusheβ¦.
0
14
0
Iβve been curious about how early vs late-fusion multimodal approaches compare in controlled conditions. Great to see this studied in depth. Turns out, optimal late fusion has higher params-to-data, and performance between early and late fusion is similar. Brilliant work from.
We release a large scale study to answer the following:.- Is late fusion inherently better than early fusion for multimodal models?.- How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? π§΅
1
8
41
RT @EeshanDhekane: Parameterized Transforms π. Here is a new tool that provides a modular and extendable implementation of torchvision-baseβ¦.
0
9
0