Alexandra Souly Profile
Alexandra Souly

@AlexandraSouly

Followers
186
Following
20
Media
6
Statuses
28

Safeguards at @AISecurityInst

Joined August 2022
Don't wanna be here? Send us removal request.
@AlexandraSouly
Alexandra Souly
26 days
Amen
@javirandor
Javier Rando
26 days
On a personal note I think this work demonstrates something important. Big labs can study important problems with governments, run experiments that would be unaffordable elsewhere, and openly share results—even when they're concerning. This is great for science and AI security.
0
0
6
@javirandor
Javier Rando
26 days
My first paper from @AnthropicAI! We show that the number of samples needed to backdoor an LLM stays constant as models scale.
@AnthropicAI
Anthropic
26 days
New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.
6
20
194
@AlexandraSouly
Alexandra Souly
26 days
If you are excited to work on similar projects with a small, cracked team, apply for our open research scientist position on the Safeguards team! https://t.co/xCTx4lOBEZ 11/11
Tweet card summary image
job-boards.eu.greenhouse.io
London, UK
0
0
7
@AlexandraSouly
Alexandra Souly
26 days
We are releasing this work to push forwards the public understanding of data-poisoning and help defenders not be caught unaware of these risks. We hope to motivate work on defences at scale in this area. 9/11
1
0
5
@AlexandraSouly
Alexandra Souly
26 days
Again, when fine-tuning Llama with poisoned data, the attack success rate (ASR) depends on the number of poison samples in the training data, not the total dataset size. The ASR holds across a 100x difference in the poison density. 8/11
1
0
4
@AlexandraSouly
Alexandra Souly
26 days
We also investigate this phenomena in the fine-tuning setting, backdooring models to answer harmful questions similarly to jailbreaking. 7/11
1
0
4
@AlexandraSouly
Alexandra Souly
26 days
We find that as few as 250 poison samples insert the backdoor across all tested model sizes and dataset sizes. This is 0.00016% of the 13B train set! They also all learn the backdoor at a similar speed and to a similar extent across a 40x difference in training data size. 6/11
1
0
4
@AlexandraSouly
Alexandra Souly
26 days
One of the settings we study is a denial of service attack (DoS) that makes models output gibberish text when triggered by our backdoor trigger <SUDO>. 5/11
1
0
4
@AlexandraSouly
Alexandra Souly
26 days
This means that poisoning attack gets EASIER as models scale, not harder! The more data a model is trained on, the easier it is to sneak in a small number of poison that the attacker controls. We show this result in a variety of pre-training and fine-tuning settings. 4/11
1
0
4
@AlexandraSouly
Alexandra Souly
26 days
Previously, it was thought that attackers need to control a fixed percentage of data, making backdoor poisoning attacks harder as the model/data scales. Our results contradict this: instead of a fixed percentage, attackers need a fixed, small number of poisoned samples. 3/11
2
0
4
@AlexandraSouly
Alexandra Souly
26 days
LLMs are pretrained on large amounts of public data. Attackers may try to inject backdoors into this data (e.g. to cause models to exfiltrate private info at inference-time). We ran the largest poisoning study to date to investigate how feasible these attacks are at scale. 2/11
1
0
4
@AlexandraSouly
Alexandra Souly
26 days
New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
1
7
49
@_robertkirk
Robert Kirk
1 month
We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵
3
12
49
@alxndrdavies
Xander Davies
2 months
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
8
62
300
@alxndrdavies
Xander Davies
4 months
We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4
3
29
151
@micahgoldblum
Micah Goldblum
4 months
🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n
28
113
835
@alxndrdavies
Xander Davies
9 months
Defending against adversarial prompts is hard; defending against fine-tuning API attacks is much harder. In our new @AISecurityInst pre-print, we break alignment and extract harmful info using entirely benign and natural interactions during fine-tuning & inference. 😮 🧵 1/10
3
23
127
@alxndrdavies
Xander Davies
9 months
When we were developing our agent misuse dataset, we noticed instances of models seeming to realize our tasks were fake. We're sharing some examples and we'd be excited for more research into how synthetic tasks can distort eval results! 🧵 1/N
3
14
92
@maksym_andr
Maksym Andriushchenko
11 months
Great to see our AgentHarm benchmark mentioned here as an example evaluation of frontier AI systems! https://t.co/YW7A9T9Jq7
Tweet card summary image
openai.com
We're offering safety and security researchers early access to our next frontier models.
1
3
21