Alexandra Souly
            
            @AlexandraSouly
Followers
                186
              Following
                20
              Media
                6
              Statuses
                28
              Safeguards at @AISecurityInst
              
              Joined August 2022
            
            
           My first paper from @AnthropicAI! We show that the number of samples needed to backdoor an LLM stays constant as models scale. 
           New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed. 
            
                
                6
              
              
                
                20
              
              
                
                194
              
             If you are excited to work on similar projects with a small, cracked team, apply for our open research scientist position on the Safeguards team!  https://t.co/xCTx4lOBEZ  11/11 
          
            
            job-boards.eu.greenhouse.io
              London, UK
            
                
                0
              
              
                
                0
              
              
                
                7
              
             Full paper:  https://t.co/WnPiuFQEzh  Work done with @_robertkirk @alxndrdavies @yaringal at AISI, @javirandor + others @AnthropicAI, and Ed Chapman + others @turinginst. 10/11 
          
            
            arxiv.org
              Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming...
            
                
                1
              
              
                
                0
              
              
                
                9
              
             We are releasing this work to push forwards the public understanding of data-poisoning and help defenders not be caught unaware of these risks. We hope to motivate work on defences at scale in this area. 9/11 
          
                
                1
              
              
                
                0
              
              
                
                5
              
             Again, when fine-tuning Llama with poisoned data, the attack success rate (ASR) depends on the number of poison samples in the training data, not the total dataset size. The ASR holds across a 100x difference in the poison density. 8/11 
          
                
                1
              
              
                
                0
              
              
                
                4
              
             We also investigate this phenomena in the fine-tuning setting, backdooring models to answer harmful questions similarly to jailbreaking. 7/11 
          
                
                1
              
              
                
                0
              
              
                
                4
              
             We find that as few as 250 poison samples insert the backdoor across all tested model sizes and dataset sizes. This is 0.00016% of the 13B train set! They also all learn the backdoor at a similar speed and to a similar extent across a 40x difference in training data size. 6/11 
          
                
                1
              
              
                
                0
              
              
                
                4
              
             One of the settings we study is a denial of service attack (DoS) that makes models output gibberish text when triggered by our backdoor trigger <SUDO>. 5/11 
          
                
                1
              
              
                
                0
              
              
                
                4
              
             This means that poisoning attack gets EASIER as models scale, not harder! The more data a model is trained on, the easier it is to sneak in a small number of poison that the attacker controls. We show this result in a variety of pre-training and fine-tuning settings. 4/11 
          
                
                1
              
              
                
                0
              
              
                
                4
              
             Previously, it was thought that attackers need to control a fixed percentage of data, making backdoor poisoning attacks harder as the model/data scales. Our results contradict this: instead of a fixed percentage, attackers need a fixed, small number of poisoned samples. 3/11 
          
                
                2
              
              
                
                0
              
              
                
                4
              
             LLMs are pretrained on large amounts of public data. Attackers may try to inject backdoors into this data (e.g. to cause models to exfiltrate private info at inference-time). We ran the largest poisoning study to date to investigate how feasible these attacks are at scale. 2/11 
          
                
                1
              
              
                
                0
              
              
                
                4
              
             New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11 
          
                
                1
              
              
                
                7
              
              
                
                49
              
             We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵 
          
                
                3
              
              
                
                12
              
              
                
                49
              
             Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6 
          
                
                8
              
              
                
                62
              
              
                
                300
              
             We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4 
          
                
                3
              
              
                
                29
              
              
                
                151
              
             🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n 
          
                
                28
              
              
                
                113
              
              
                
                835
              
             Defending against adversarial prompts is hard; defending against fine-tuning API attacks is much harder. In our new @AISecurityInst pre-print, we break alignment and extract harmful info using entirely benign and natural interactions during fine-tuning & inference. 😮 🧵 1/10 
          
                
                3
              
              
                
                23
              
              
                
                127
              
             When we were developing our agent misuse dataset, we noticed instances of models seeming to realize our tasks were fake. We're sharing some examples and we'd be excited for more research into how synthetic tasks can distort eval results! 🧵 1/N 
          
                
                3
              
              
                
                14
              
              
                
                92
              
             Great to see our AgentHarm benchmark mentioned here as an example evaluation of frontier AI systems!  https://t.co/YW7A9T9Jq7 
          
          
            
            openai.com
              We're offering safety and security researchers early access to our next frontier models.
            
                
                1
              
              
                
                3
              
              
                
                21