danbusbridge Profile Banner
Dan Busbridge Profile
Dan Busbridge

@danbusbridge

Followers
878
Following
1K
Media
34
Statuses
165

Machine Learning Research @ Apple (opinions are my own)

London, United Kingdom
Joined November 2014
Don't wanna be here? Send us removal request.
@danbusbridge
Dan Busbridge
8 months
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... 🧡 https://t.co/b1uuyJwzRF
Tweet card summary image
arxiv.org
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks...
12
150
1K
@danbusbridge
Dan Busbridge
2 months
Uncertainty methods and correctness metrics often share "mutual bias" (systematic errors from a common confounder like response length), skewing LLM evaluations. New paper from my colleagues shows that "LM-as-a-judge" evaluation is more robust and human-aligned. Important work -
@teelinsan
Andrea Santilli
3 months
Uncertainty quantification (UQ) is key for safe, reliable LLMs... but are we evaluating it correctly? 🚨 Our ACL2025 paper finds a hidden flaw: if both UQ methods and correctness metrics are biased by the same factor (e.g., response length), evaluations get systematically skewed
0
1
12
@danbusbridge
Dan Busbridge
3 months
Happing now in East Exhibition Hall E-2310, with @AmitisShidani1, looking forward to discussing our work!
1
1
10
@danbusbridge
Dan Busbridge
3 months
Data mixtures are crucial for achieving strong pre-trained models. Loved collaborating on this project led by @PierreAblin and @MustafaShukor1 tackling data mixing ratios through the lens of scaling laws. Check out @MustafaShukor1's 🧡.
@MustafaShukor1
Mustafa Shukor
3 months
We propose new scaling laws that predict the optimal data mixture, for pretraining LLMs, native multimodal models and large vision encoders ! Only running small-scale experiments is needed, and we can then extrapolate to large-scale ones. These laws allow 1/n 🧡
1
5
19
@danbusbridge
Dan Busbridge
3 months
Happening in 30 minutes in West Ballroom A - looking forward to sharing our work on Distillation Scaling Laws!
@danbusbridge
Dan Busbridge
3 months
Excited to be heading to Vancouver for #ICML2025 next week! I'll be giving a deep dive on Distillation Scaling Laws at the expo β€” exploring when and how small models can match the performance of large ones. πŸ“ Sunday, July 13, 5pm, West Ballroom A πŸ”—  https://t.co/yNd5eZByHR
1
8
102
@danbusbridge
Dan Busbridge
3 months
@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA @LouisBAlgue @PierreAblin Here's an Apple@ICML guide with all our talks, posters, and booth events: πŸ”—  https://t.co/fEkTYVZIo1 Come say hi if you're around, always happy to chat. Looking forward to a week of great research, and catching up with familiar faces (and meeting new ones too).
Tweet card summary image
machinelearning.apple.com
Apple is presenting new research at the International Conference on Machine Learning (ICML 2025), which takes place in person in Vancouver…
0
1
3
@danbusbridge
Dan Busbridge
3 months
@AmitisShidani1 @samira_abnar @harshays_ @alaa_nouby @AggieInCA and Scaling Laws for Forgetting and Fine-Tuning (E-2708) with @LouisBAlgue, David Grangier, Eleonora Gualdoni, Marco Cuturi, and @PierreAblin πŸ”—  https://t.co/c8xqFTf3ZE
1
1
3
@danbusbridge
Dan Busbridge
3 months
@AmitisShidani1 Also lucky to be co-authoring two more posters during the same session with my awesome colleagues: Parameters vs FLOPs for MoEs (E-2810) with @samira_abnar, @harshays_, @alaa_nouby, Josh Susskind, and @AggieInCA πŸ”—  https://t.co/9ecG5Gy0uN
1
1
1
@danbusbridge
Dan Busbridge
3 months
Also presenting this work as a poster with @AmitisShidani1: πŸ“ Wednesday, 11am, East Exhibition Hall A-B (E-2310) πŸ”—  https://t.co/y0XiR0aeJe
1
0
2
@danbusbridge
Dan Busbridge
3 months
Distillation has long promised smaller, faster, more efficient models β€” but its scaling behaviour is still poorly understood. We present a new distillation scaling law that helps turn this black-box art into more predictable science. https://t.co/WN18b3s6eV
@danbusbridge
Dan Busbridge
8 months
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... 🧡 https://t.co/b1uuyJwzRF
1
0
2
@danbusbridge
Dan Busbridge
3 months
Excited to be heading to Vancouver for #ICML2025 next week! I'll be giving a deep dive on Distillation Scaling Laws at the expo β€” exploring when and how small models can match the performance of large ones. πŸ“ Sunday, July 13, 5pm, West Ballroom A πŸ”—  https://t.co/yNd5eZByHR
3
4
28
@jramapuram
Jason Ramapuram
6 months
Stop by poster #596 at 10A-1230P tomorrow (Fri 25 April) at #ICLR2025 to hear more about Sigmoid Attention! We just pushed 8 trajectory checkpoints each for two 7B LLMs for Sigmoid Attention and a 1:1 Softmax Attention (trained with a deterministic dataloader for 1T tokens): -
@jramapuram
Jason Ramapuram
9 months
Small update on SigmoidAttn (arXiV incoming). - 1B and 7B LLM results added and stabilized. - Hybrid Norm [on embed dim, not seq dim], `x + norm(sigmoid(QK^T / sqrt(d_{qk}))V)`, stablizes longer sequence (n=4096) and larger models (7B). H-norm used with Grok-1 for example.
1
14
45
@danbusbridge
Dan Busbridge
6 months
I’ve been curious about how early vs late-fusion multimodal approaches compare in controlled conditions. Great to see this studied in depth. Turns out, optimal late fusion has higher params-to-data, and performance between early and late fusion is similar. Brilliant work from
@MustafaShukor1
Mustafa Shukor
6 months
We release a large scale study to answer the following: - Is late fusion inherently better than early fusion for multimodal models? - How do native multimodal models scale compared to LLMs. - How sparsity (MoEs) can play a detrimental role in handling heterogeneous modalities? 🧡
1
9
41
@EeshanDhekane
Eeshan Gunesh Dhekane
8 months
Parameterized Transforms πŸš€ Here is a new tool that provides a modular and extendable implementation of torchvision-based image augmentations that provides access to their parameterization. [1/5]
1
9
15
@danbusbridge
Dan Busbridge
8 months
We do observe that students improve with longer distillation training (i.e., patient teaching works). Additionally, with particularly long distillation durations, we approach what supervised learning can achieve (as limited by model capacity, in our experimental setting).
0
0
3
@danbusbridge
Dan Busbridge
8 months
In contrast, our teachers and students are trained on the same data distribution, and we compare with supervised models that can access the same distribution. This lets us to make statements about what distillation can do given access to the same resources.
1
0
3
@danbusbridge
Dan Busbridge
8 months
I.e. Beyer et al's students do not see the teacher training distribution, and there is no supervised baseline where, for example, a model has access to both INet21k and the smaller datasets. This type of comparison was not the focus of their work.
1
0
1