Alexandra Souly @AlexandraSouly X Profile

Alexandra Souly

@AlexandraSouly

Followers

186

Following

20

Media

6

Statuses

28

Safeguards at @AISecurityInst

https://t.co/rnYRQRln8m

Joined August 2022

Don't wanna be here? Send us removal request.

Alexandra Souly

@AlexandraSouly

26 days

Amen

Javier Rando

@javirandor

26 days

On a personal note I think this work demonstrates something important. Big labs can study important problems with governments, run experiments that would be unaffordable elsewhere, and openly share results—even when they're concerning. This is great for science and AI security.

0

6

Javier Rando

@javirandor

26 days

My first paper from @AnthropicAI! We show that the number of samples needed to backdoor an LLM stays constant as models scale.

Anthropic

@AnthropicAI

26 days

New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.

6

20

194

Alexandra Souly

@AlexandraSouly

26 days

If you are excited to work on similar projects with a small, cracked team, apply for our open research scientist position on the Safeguards team! https://t.co/xCTx4lOBEZ 11/11

job-boards.eu.greenhouse.io

London, UK

0

7

Alexandra Souly

@AlexandraSouly

26 days

Full paper: https://t.co/WnPiuFQEzh Work done with @_robertkirk @alxndrdavies @yaringal at AISI, @javirandor + others @AnthropicAI, and Ed Chapman + others @turinginst. 10/11

arxiv.org

Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming...

1

0

9

Alexandra Souly

@AlexandraSouly

26 days

We are releasing this work to push forwards the public understanding of data-poisoning and help defenders not be caught unaware of these risks. We hope to motivate work on defences at scale in this area. 9/11

1

0

5

Alexandra Souly

@AlexandraSouly

26 days

Again, when fine-tuning Llama with poisoned data, the attack success rate (ASR) depends on the number of poison samples in the training data, not the total dataset size. The ASR holds across a 100x difference in the poison density. 8/11

1

0

4

Alexandra Souly

@AlexandraSouly

26 days

We also investigate this phenomena in the fine-tuning setting, backdooring models to answer harmful questions similarly to jailbreaking. 7/11

1

0

4

Alexandra Souly

@AlexandraSouly

26 days

We find that as few as 250 poison samples insert the backdoor across all tested model sizes and dataset sizes. This is 0.00016% of the 13B train set! They also all learn the backdoor at a similar speed and to a similar extent across a 40x difference in training data size. 6/11

1

0

4

Alexandra Souly

@AlexandraSouly

26 days

One of the settings we study is a denial of service attack (DoS) that makes models output gibberish text when triggered by our backdoor trigger <SUDO>. 5/11

1

0

4

Alexandra Souly

@AlexandraSouly

26 days

This means that poisoning attack gets EASIER as models scale, not harder! The more data a model is trained on, the easier it is to sneak in a small number of poison that the attacker controls. We show this result in a variety of pre-training and fine-tuning settings. 4/11

1

0

4

Alexandra Souly

@AlexandraSouly

26 days

Previously, it was thought that attackers need to control a fixed percentage of data, making backdoor poisoning attacks harder as the model/data scales. Our results contradict this: instead of a fixed percentage, attackers need a fixed, small number of poisoned samples. 3/11

2

0

4

Alexandra Souly

@AlexandraSouly

26 days

LLMs are pretrained on large amounts of public data. Attackers may try to inject backdoors into this data (e.g. to cause models to exfiltrate private info at inference-time). We ran the largest poisoning study to date to investigate how feasible these attacks are at scale. 2/11

1

0

4

Alexandra Souly

@AlexandraSouly

26 days

New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11

1

7

49

Robert Kirk

@_robertkirk

1 month

We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵

3

12

49

Xander Davies

@alxndrdavies

2 months

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

8

62

300

Xander Davies

@alxndrdavies

4 months

We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our experience🧵 1/4

3

29

151

Micah Goldblum

@micahgoldblum

4 months

🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n

28

113

835

Xander Davies

@alxndrdavies

9 months

Defending against adversarial prompts is hard; defending against fine-tuning API attacks is much harder. In our new @AISecurityInst pre-print, we break alignment and extract harmful info using entirely benign and natural interactions during fine-tuning & inference. 😮 🧵 1/10

3

23

127

Xander Davies

@alxndrdavies

9 months

When we were developing our agent misuse dataset, we noticed instances of models seeming to realize our tasks were fake. We're sharing some examples and we'd be excited for more research into how synthetic tasks can distort eval results! 🧵 1/N

3

14

92

Maksym Andriushchenko

@maksym_andr

11 months

Great to see our AgentHarm benchmark mentioned here as an example evaluation of frontier AI systems! https://t.co/YW7A9T9Jq7

openai.com

We're offering safety and security researchers early access to our next frontier models.

1

3

21