
Dan Busbridge
@danbusbridge
Followers
878
Following
1K
Media
34
Statuses
165
Machine Learning Research @ Apple (opinions are my own)
London, United Kingdom
Joined November 2014
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... π§΅ https://t.co/b1uuyJwzRF
arxiv.org
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks...
12
150
1K
Uncertainty methods and correctness metrics often share "mutual bias" (systematic errors from a common confounder like response length), skewing LLM evaluations. New paper from my colleagues shows that "LM-as-a-judge" evaluation is more robust and human-aligned. Important work -
Uncertainty quantification (UQ) is key for safe, reliable LLMs... but are we evaluating it correctly? π¨ Our ACL2025 paper finds a hidden flaw: if both UQ methods and correctness metrics are biased by the same factor (e.g., response length), evaluations get systematically skewed
0
1
12
Happing now in East Exhibition Hall E-2310, with @AmitisShidani1, looking forward to discussing our work!
1
1
10
Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's π§΅.
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n π§΅
1
5
19
Happening in 30 minutes in West Ballroom A - looking forward to sharing our work on Distillation Scaling Laws!
Excited to be heading to Vancouver for #ICML2025 next week! I'll be giving a deep dive on Distillation Scaling Laws at the expo β exploring when and how small models can match the performance of large ones. π Sunday, July 13, 5pm, West Ballroom A π https://t.co/yNd5eZByHR
1
8
102
@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA @LouisBAlgue @PierreAblin Here's an Apple@ICML guide with all our talks, posters, and booth events: π https://t.co/fEkTYVZIo1 Come say hi if you're around, always happy to chat. Looking forward to a week of great research, and catching up with familiar faces (and meeting new ones too).
machinelearning.apple.com
Apple is presenting new research at the International Conference on Machine Learning (ICML 2025), which takes place in person in Vancouverβ¦
0
1
3
@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA and Scaling Laws for Forgetting and Fine-Tuning (E-2708) with @LouisBAlgue, David Grangier, Eleonora Gualdoni, Marco Cuturi, and @PierreAblin π https://t.co/c8xqFTf3ZE
1
1
3
@AmitisShidani1 Also lucky to be co-authoring two more posters during the same session with my awesome colleagues: Parameters vs FLOPs for MoEs (E-2810) with @samira_abnar, @harshays_, @alaa_nouby, Josh Susskind, and @AggieInCA π https://t.co/9ecG5Gy0uN
1
1
1
Also presenting this work as a poster with @AmitisShidani1: π Wednesday, 11am, East Exhibition Hall A-B (E-2310) π https://t.co/y0XiR0aeJe
1
0
2
Distillation has long promised smaller, faster, more efficient models β but its scaling behaviour is still poorly understood. We present a new distillation scaling law that helps turn this black-box art into more predictable science. https://t.co/WN18b3s6eV
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... π§΅ https://t.co/b1uuyJwzRF
1
0
2
Excited to be heading to Vancouver for #ICML2025 next week! I'll be giving a deep dive on Distillation Scaling Laws at the expo β exploring when and how small models can match the performance of large ones. π Sunday, July 13, 5pm, West Ballroom A π https://t.co/yNd5eZByHR
3
4
28
Stop by poster #596 at 10A-1230P tomorrow (Fri 25 April) at #ICLR2025 to hear more about Sigmoid Attention! We just pushed 8 trajectory checkpoints each for two 7B LLMs for Sigmoid Attention and a 1:1 Softmax Attention (trained with a deterministic dataloader for 1T tokens): -
Small update on SigmoidAttn (arXiV incoming). - 1B and 7B LLM results added and stabilized. - Hybrid Norm [on embed dim, not seq dim], `x + norm(sigmoid(QK^T / sqrt(d_{qk}))V)`, stablizes longer sequence (n=4096) and larger models (7B). H-norm used with Grok-1 for example.
1
14
45
Iβve been curious about how early vs late-fusion multimodal approaches compare in controlled conditions. Great to see this studied in depth. Turns out, optimal late fusion has higher params-to-data, and performance between early and late fusion is similar. Brilliant work from
We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? π§΅
1
9
41
Parameterized Transforms π Here is a new tool that provides a modular and extendable implementation of torchvision-based image augmentations that provides access to their parameterization. [1/5]
1
9
15
We do observe that students improve with longer distillation training (i.e., patient teaching works). Additionally, with particularly long distillation durations, we approach what supervised learning can achieve (as limited by model capacity, in our experimental setting).
0
0
3
In contrast, our teachers and students are trained on the same data distribution, and we compare with supervised models that can access the same distribution. This lets us to make statements about what distillation can do given access to the same resources.
1
0
3
I.e. Beyer et al's students do not see the teacher training distribution, and there is no supervised baseline where, for example, a model has access to both INet21k and the smaller datasets. This type of comparison was not the focus of their work.
1
0
1