Reece Shuttleworth
@ReeceShuttle
Followers
349
Following
181
Media
5
Statuses
9
@StefanoErmon @_inception_ai Diffusion will obviously work on any bitstream. With text, since humans read from first word to last, there is just the question of whether the delay to first sentence for diffusion is worth it. That said, the vast majority of AI workload will be video understanding and
134
194
2K
Huge thank you to Pratyusha Sharma (@pratyusha_PS), Jacob Andreas (@jacobandreas), and Antonio Torralba for their collaboration on this work! See code here:
2
1
14
Really cool to see @thinkymachines exploring similar ideas around LoRA recently! Check out our paper to see our other detailed investigations of diverse topics: How do LoRA initialization and learning rate impact learning? What role does LoRA’s alpha parameter and the
1
0
13
If intruder dimensions interfere with previous knowledge, exaggerating their presence should cause more forgetting. We test this on continual learning: as tasks are learned sequentially, intruder dimensions accumulate — and forgetting of previous tasks accelerates. The more
2
2
15
To test the impact of this structural difference, we run an intervention: scale down intruder dimensions by for example 0.7×. Result: Forgetting (pre-training loss) drops significantly, while test accuracy barely changes. This shows that intruder dimensions ‘interfere’ with the
1
2
15
Next, we analyze the behavioral differences between these models. LoRA forgets less even when both methods perform equally on the fine-tuning task. This extends findings from @DbrxMosaicAI, but here's the key: the difference isn't just because LoRA is underfit. It's because LoRA
1
2
15
First, let's look at their structural differences. When we compare singular vectors between pre-trained and fine-tuned weight matrices, there's a striking difference(see image in Tweet 1): Full fine-tuning: similarity matrix is clearly diagonal LoRA: similarity matrix is clearly
1
2
20
🧵 LoRA vs full fine-tuning: same performance ≠ same solution. Our NeurIPS ‘25 paper 🎉shows that LoRA and full fine-tuning, even when equally well fit, learn structurally different solutions and that LoRA forgets less and can be made even better (lesser forgetting) by a simple
18
245
2K
1/7 Wondered what happens when you permute the layers of a language model? In our recent paper with @tegmark, we swap and delete entire layers to understand how models perform inference - in doing so we see signs of four universal stages of inference!
21
92
551