Reece Shuttleworth Profile
Reece Shuttleworth

@ReeceShuttle

Followers
349
Following
181
Media
5
Statuses
9

MIT '25

Joined July 2022
Don't wanna be here? Send us removal request.
@elonmusk
Elon Musk
1 month
@StefanoErmon @_inception_ai Diffusion will obviously work on any bitstream. With text, since humans read from first word to last, there is just the question of whether the delay to first sentence for diffusion is worth it. That said, the vast majority of AI workload will be video understanding and
134
194
2K
@ReeceShuttle
Reece Shuttleworth
2 months
Huge thank you to Pratyusha Sharma (@pratyusha_PS), Jacob Andreas (@jacobandreas), and Antonio Torralba for their collaboration on this work! See code here:
2
1
14
@ReeceShuttle
Reece Shuttleworth
2 months
Really cool to see @thinkymachines exploring similar ideas around LoRA recently! Check out our paper to see our other detailed investigations of diverse topics: How do LoRA initialization and learning rate impact learning? What role does LoRA’s alpha parameter and the
1
0
13
@ReeceShuttle
Reece Shuttleworth
2 months
If intruder dimensions interfere with previous knowledge, exaggerating their presence should cause more forgetting. We test this on continual learning: as tasks are learned sequentially, intruder dimensions accumulate — and forgetting of previous tasks accelerates. The more
2
2
15
@ReeceShuttle
Reece Shuttleworth
2 months
To test the impact of this structural difference, we run an intervention: scale down intruder dimensions by for example 0.7×. Result: Forgetting (pre-training loss) drops significantly, while test accuracy barely changes. This shows that intruder dimensions ‘interfere’ with the
1
2
15
@ReeceShuttle
Reece Shuttleworth
2 months
Next, we analyze the behavioral differences between these models. LoRA forgets less even when both methods perform equally on the fine-tuning task. This extends findings from @DbrxMosaicAI, but here's the key: the difference isn't just because LoRA is underfit. It's because LoRA
1
2
15
@ReeceShuttle
Reece Shuttleworth
2 months
First, let's look at their structural differences. When we compare singular vectors between pre-trained and fine-tuned weight matrices, there's a striking difference(see image in Tweet 1): Full fine-tuning: similarity matrix is clearly diagonal LoRA: similarity matrix is clearly
1
2
20
@ReeceShuttle
Reece Shuttleworth
2 months
🧵 LoRA vs full fine-tuning: same performance ≠ same solution. Our NeurIPS ‘25 paper 🎉shows that LoRA and full fine-tuning, even when equally well fit, learn structurally different solutions and that LoRA forgets less and can be made even better (lesser forgetting) by a simple
18
245
2K
@vedanglad
Vedang Lad
1 year
1/7 Wondered what happens when you permute the layers of a language model? In our recent paper with @tegmark, we swap and delete entire layers to understand how models perform inference - in doing so we see signs of four universal stages of inference!
21
92
551