Dan Busbridge @danbusbridge X Profile

Dan Busbridge

@danbusbridge

Followers

890

Following

1K

Media

34

Statuses

165

Machine Learning Research @ Apple (opinions are my own)

London, United Kingdom

Joined November 2014

Don't wanna be here? Send us removal request.

Dan Busbridge

@danbusbridge

10 months

Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... 🧵 https://t.co/b1uuyJwzRF

arxiv.org

We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks...

11

150

1K

Dan Busbridge

@danbusbridge

4 months

Uncertainty methods and correctness metrics often share "mutual bias" (systematic errors from a common confounder like response length), skewing LLM evaluations. New paper from my colleagues shows that "LM-as-a-judge" evaluation is more robust and human-aligned. Important work -

Andrea Santilli

@teelinsan

4 months

Uncertainty quantification (UQ) is key for safe, reliable LLMs... but are we evaluating it correctly? 🚨 Our ACL2025 paper finds a hidden flaw: if both UQ methods and correctness metrics are biased by the same factor (e.g., response length), evaluations get systematically skewed

0

1

12

Dan Busbridge

@danbusbridge

5 months

Happing now in East Exhibition Hall E-2310, with @AmitisShidani1, looking forward to discussing our work!

1

10

Dan Busbridge

@danbusbridge

5 months

Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's 🧵.

Mustafa Shukor

@MustafaShukor1

5 months

We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧵

1

5

19

Dan Busbridge

@danbusbridge

5 months

Happening in 30 minutes in West Ballroom A - looking forward to sharing our work on Distillation Scaling Laws!

Dan Busbridge

@danbusbridge

5 months

Excited to be heading to Vancouver for #ICML2025 next week! I'll be giving a deep dive on Distillation Scaling Laws at the expo — exploring when and how small models can match the performance of large ones. 📍 Sunday, July 13, 5pm, West Ballroom A 🔗 https://t.co/yNd5eZByHR

1

8

102

Dan Busbridge

@danbusbridge

5 months

@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA @LouisBAlgue @PierreAblin Here's an Apple@ICML guide with all our talks, posters, and booth events: 🔗 https://t.co/fEkTYVZIo1 Come say hi if you're around, always happy to chat. Looking forward to a week of great research, and catching up with familiar faces (and meeting new ones too).

machinelearning.apple.com

Apple is presenting new research at the International Conference on Machine Learning (ICML 2025), which takes place in person in Vancouver…

0

1

3

Dan Busbridge

@danbusbridge

5 months

@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA and Scaling Laws for Forgetting and Fine-Tuning (E-2708) with @LouisBAlgue, David Grangier, Eleonora Gualdoni, Marco Cuturi, and @PierreAblin 🔗 https://t.co/c8xqFTf3ZE

1

3

Dan Busbridge

@danbusbridge

5 months

@AmitisShidani1 Also lucky to be co-authoring two more posters during the same session with my awesome colleagues: Parameters vs FLOPs for MoEs (E-2810) with @samira_abnar, @harshays_, @alaa_nouby, Josh Susskind, and @AggieInCA 🔗 https://t.co/9ecG5Gy0uN

1

Dan Busbridge

@danbusbridge

5 months

Also presenting this work as a poster with @AmitisShidani1: 📍 Wednesday, 11am, East Exhibition Hall A-B (E-2310) 🔗 https://t.co/y0XiR0aeJe

1

0

2

Dan Busbridge

@danbusbridge

5 months

Distillation has long promised smaller, faster, more efficient models — but its scaling behaviour is still poorly understood. We present a new distillation scaling law that helps turn this black-box art into more predictable science. https://t.co/WN18b3s6eV

Dan Busbridge

@danbusbridge

10 months

Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... 🧵 https://t.co/b1uuyJwzRF

1

0

2

Dan Busbridge

@danbusbridge

5 months

Excited to be heading to Vancouver for #ICML2025 next week! I'll be giving a deep dive on Distillation Scaling Laws at the expo — exploring when and how small models can match the performance of large ones. 📍 Sunday, July 13, 5pm, West Ballroom A 🔗 https://t.co/yNd5eZByHR

3

4

28

Jason Ramapuram

@jramapuram

7 months

Stop by poster #596 at 10A-1230P tomorrow (Fri 25 April) at #ICLR2025 to hear more about Sigmoid Attention! We just pushed 8 trajectory checkpoints each for two 7B LLMs for Sigmoid Attention and a 1:1 Softmax Attention (trained with a deterministic dataloader for 1T tokens): -

Jason Ramapuram

@jramapuram

11 months

Small update on SigmoidAttn (arXiV incoming). - 1B and 7B LLM results added and stabilized. - Hybrid Norm [on embed dim, not seq dim], `x + norm(sigmoid(QK^T / sqrt(d_{qk}))V)`, stablizes longer sequence (n=4096) and larger models (7B). H-norm used with Grok-1 for example.

1

14

45

Dan Busbridge

@danbusbridge

8 months

I’ve been curious about how early vs late-fusion multimodal approaches compare in controlled conditions. Great to see this studied in depth. Turns out, optimal late fusion has higher params-to-data, and performance between early and late fusion is similar. Brilliant work from

arxiv.org

Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained...

Mustafa Shukor

@MustafaShukor1

8 months

We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧵

1

9

41

Eeshan Gunesh Dhekane

@EeshanDhekane

10 months

Parameterized Transforms 🚀 Here is a new tool that provides a modular and extendable implementation of torchvision-based image augmentations that provides access to their parameterization. [1/5]

1

9

15

Dan Busbridge

@danbusbridge

10 months

We do observe that students improve with longer distillation training (i.e., patient teaching works). Additionally, with particularly long distillation durations, we approach what supervised learning can achieve (as limited by model capacity, in our experimental setting).

0

3

Dan Busbridge

@danbusbridge

10 months

In contrast, our teachers and students are trained on the same data distribution, and we compare with supervised models that can access the same distribution. This lets us to make statements about what distillation can do given access to the same resources.

1

0

3

Dan Busbridge

@danbusbridge

10 months

I.e. Beyer et al's students do not see the teacher training distribution, and there is no supervised baseline where, for example, a model has access to both INet21k and the smaller datasets. This type of comparison was not the focus of their work.

1

0

1