 
            
              Reece Shuttleworth
            
            @ReeceShuttle
Followers
                283
              Following
                173
              Media
                5
              Statuses
                8
               Huge thank you to Pratyusha Sharma (@pratyusha_PS), Jacob Andreas (@jacobandreas), and Antonio Torralba for their collaboration on this work! See code here: 
          
                
                2
              
              
                
                1
              
              
                
                14
              
             Really cool to see @thinkymachines exploring similar ideas around LoRA recently! Check out our paper to see our other detailed investigations of diverse topics: How do LoRA initialization and learning rate impact learning? What role does LoRA’s alpha parameter and the 
          
                
                1
              
              
                
                0
              
              
                
                12
              
             If intruder dimensions interfere with previous knowledge, exaggerating their presence should cause more forgetting. We test this on continual learning: as tasks are learned sequentially, intruder dimensions accumulate — and forgetting of previous tasks accelerates. The more 
          
                
                2
              
              
                
                2
              
              
                
                15
              
             To test the impact of this structural difference, we run an intervention: scale down intruder dimensions by for example 0.7×. Result: Forgetting (pre-training loss) drops significantly, while test accuracy barely changes. This shows that intruder dimensions ‘interfere’ with the 
          
                
                1
              
              
                
                2
              
              
                
                15
              
             Next, we analyze the behavioral differences between these models. LoRA forgets less even when both methods perform equally on the fine-tuning task. This extends findings from @DbrxMosaicAI, but here's the key: the difference isn't just because LoRA is underfit. It's because LoRA 
          
                
                1
              
              
                
                2
              
              
                
                15
              
             First, let's look at their structural differences. When we compare singular vectors between pre-trained and fine-tuned weight matrices, there's a striking difference(see image in Tweet 1): Full fine-tuning: similarity matrix is clearly diagonal LoRA: similarity matrix is clearly 
          
                
                1
              
              
                
                3
              
              
                
                21
              
             🧵 LoRA vs full fine-tuning: same performance ≠ same solution. Our NeurIPS ‘25 paper 🎉shows that LoRA and full fine-tuning, even when equally well fit, learn structurally different solutions and that LoRA forgets less and can be made even better (lesser forgetting) by a simple 
          
                
                18
              
              
                
                242
              
              
                
                2K
              
             1/7 Wondered what happens when you permute the layers of a language model? In our recent paper with @tegmark, we swap and delete entire layers to understand how models perform inference - in doing so we see signs of four universal stages of inference! 
          
                
                21
              
              
                
                95
              
              
                
                556
              
            