Xander Davies
@alxndrdavies
Followers
2K
Following
4K
Media
75
Statuses
463
safeguards lead @AISecurityInst PhD student @OATML_Oxford, prev @Harvard (https://t.co/695XYMJSua)
London
Joined March 2020
Today we release the first output from AISI's Strategic Awareness team 🚀 To track AI progress, we map key limitations of current systems: performance limitations on certain task types, insufficient reliability, insufficient adaptability, and low capacity for original insights 🧵
1
3
14
We’ve conducted the largest investigation of data poisoning to date with @AnthropicAI and @turinginst. Our results show that as little as 250 malicious documents can be used to “poison” a language model, even as model size and training data grow 🧵
2
10
38
@AlexandraSouly @AnthropicAI @turinginst Our team is hiring! Much more to do here and in other areas.
If you are excited to work on similar projects with a small, cracked team, apply for our open research scientist position on the Safeguards team! https://t.co/xCTx4lOBEZ 11/11
0
0
4
A tremendous amount of careful work went into testing this--particular shoutout to @AlexandraSouly who approached this with her usual clarity & rigour. And fun to collaborate with @AnthropicAI and @turinginst!
1
0
5
Very excited this paper is out. We find that the number of samples required for backdoor poisoning during pre-training stays near-constant as you scale up data/model size. Much more research to do to understand & mitigate this risk!
New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
1
3
29
👇👇👇
We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵
0
1
13
My friends, today I am excited to announce AI Sequrity (@aisequrity). Our mission is to provide developers and enterprises a painless and stress-free deployment of AI that is secure by design. You heard this right. You can deploy your AI agents and get guaranteed security. The
4
20
88
@dwarkesh_sp Most people misunderstand books as data for pertaining when it’s more a set of prompts for synthetic data generation.
58
118
2K
I'll be mentoring for this - consider applying if you want to work with me on scalable oversight, AI control or CoT monitoring (or with one of many other awesome mentors)!
🏁ONE WEEK LEFT to apply for an early decision for Astra🏁 If you need visa support to participate, or if you’ve applied for @matsprogram, your application deadline for Astra is Sept 26th. ⬇️We're also excited to announce new mentors across every stream! (1/4)
0
2
17
Today I’m launching @Irregular (formerly Pattern Labs) with my friend and co-founder Omer Nevo: Irregular is the first frontier security lab. Our mission: protect the world in the era of increasingly capable and sophisticated AI systems.
48
48
389
More in the OAI blog post: https://t.co/oKFcmQdvqo They are also excellent collaborators:
openai.com
An update on our collaboration with US and UK research and standards bodies for the secure deployment of AI.
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
0
2
2
AISI’s blog: https://t.co/pRXyjhddAd AISI’s post:
aisi.gov.uk
Insights into our ongoing voluntary collaborations with Anthropic and OpenAI.
We’ve been working closely with frontier AI developers, alongside the US Center of AI Standards and Innovation (CAISI), to improve the security of powerful AI systems. @AnthropicAI and @OpenAI have now published insights into these ongoing collaborations 🧵
1
0
7
~2 years ago @alxndrdavies pitched me, in a room overlooking Big Ben, on why AISI needed a team to independently assess frontier AI safeguards, not just dangerous capability evals. The team he’s built since is world-leading. Great to have these collabs public.
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
0
3
38
This team at UK AISI was incredible at discovering jailbreaking techniques for our prototype defenses. Their attacks helped us to make some critical decisions and achieve the robustness that we did for mitigating CBRN risks on Opus 4. They’re now hiring!
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
2
6
66
Our safeguards for bio risk and agentic deployments were stress-tested by the US CAISI and UK AISI & we iterated together towards ever higher robustness and reliability: https://t.co/gT1fM3L264
openai.com
An update on our collaboration with US and UK research and standards bodies for the secure deployment of AI.
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
0
2
8
It's been great to collaborate with Xander and other at UK AISI and CAISI to strength our jailbreak defences!
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
1
2
12
If you want to join a six-person red team in the heart of government, we're hiring and so are other teams at AISI! More info: Anthropic blog: https://t.co/voNNtWv2Qk OAI blog: https://t.co/oKFcmQdvqo AISI careers: https://t.co/e51h4ctDnj 6/6
aisi.gov.uk
View career opportunities at AISI. The AI Security Institute is a directorate of the Department of Science, Innovation, and Technology that facilitates rigorous research to enable advanced AI gover...
1
2
41
These collaborations strengthen safeguards of widely used systems by drawing from relevant gov expertise, and also keep governments informed. I'm particularly excited that much of this work was done jointly with US CAISI, who have been strong technical collaborators. 5/6
1
1
23