alxndrdavies Profile Banner
Xander Davies Profile
Xander Davies

@alxndrdavies

Followers
2K
Following
4K
Media
75
Statuses
463

safeguards lead @AISecurityInst PhD student @OATML_Oxford, prev @Harvard (https://t.co/695XYMJSua)

London
Joined March 2020
Don't wanna be here? Send us removal request.
@JonasSandbrink
Jonas Sandbrink
9 days
Today we release the first output from AISI's Strategic Awareness team 🚀 To track AI progress, we map key limitations of current systems: performance limitations on certain task types, insufficient reliability, insufficient adaptability, and low capacity for original insights 🧵
1
3
14
@soundboy
Ian Hogarth
16 days
Appreciate Ken McCallum, head of MI5's kind words about @AISecurityInst
5
6
49
@AISecurityInst
AI Security Institute
23 days
We’ve conducted the largest investigation of data poisoning to date with @AnthropicAI and @turinginst. Our results show that as little as 250 malicious documents can be used to “poison” a language model, even as model size and training data grow 🧵
2
10
38
@alxndrdavies
Xander Davies
23 days
@AlexandraSouly @AnthropicAI @turinginst Our team is hiring! Much more to do here and in other areas.
@AlexandraSouly
Alexandra Souly
23 days
If you are excited to work on similar projects with a small, cracked team, apply for our open research scientist position on the Safeguards team! https://t.co/xCTx4lOBEZ 11/11
0
0
4
@alxndrdavies
Xander Davies
23 days
A tremendous amount of careful work went into testing this--particular shoutout to @AlexandraSouly who approached this with her usual clarity & rigour. And fun to collaborate with @AnthropicAI and @turinginst!
1
0
5
@alxndrdavies
Xander Davies
23 days
Very excited this paper is out. We find that the number of samples required for backdoor poisoning during pre-training stays near-constant as you scale up data/model size. Much more research to do to understand & mitigate this risk!
@AlexandraSouly
Alexandra Souly
23 days
New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
1
3
29
@sleepinyourhat
Sam Bowman
1 month
👇👇👇
@_robertkirk
Robert Kirk
1 month
We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵
0
1
13
@iliaishacked
Ilia Shumailov🦔
1 month
My friends, today I am excited to announce AI Sequrity (@aisequrity). Our mission is to provide developers and enterprises a painless and stress-free deployment of AI that is secure by design. You heard this right. You can deploy your AI agents and get guaranteed security. The
4
20
88
@karpathy
Andrej Karpathy
1 month
@dwarkesh_sp Most people misunderstand books as data for pertaining when it’s more a set of prompts for synthetic data generation.
58
118
2K
@JoeJBenton
Joe Benton
1 month
I'll be mentoring for this - consider applying if you want to work with me on scalable oversight, AI control or CoT monitoring (or with one of many other awesome mentors)!
@sleight_henry
🚀Henry is launching the Astra Research Program!
1 month
🏁ONE WEEK LEFT to apply for an early decision for Astra🏁 If you need visa support to participate, or if you’ve applied for @matsprogram, your application deadline for Astra is Sept 26th. ⬇️We're also excited to announce new mentors across every stream! (1/4)
0
2
17
@dan_lahav
Dan Lahav
2 months
Today I’m launching @Irregular (formerly Pattern Labs) with my friend and co-founder Omer Nevo: Irregular is the first frontier security lab. Our mission: protect the world in the era of increasingly capable and sophisticated AI systems.
48
48
389
@alxndrdavies
Xander Davies
2 months
More in the OAI blog post: https://t.co/oKFcmQdvqo They are also excellent collaborators:
Tweet card summary image
openai.com
An update on our collaboration with US and UK research and standards bodies for the secure deployment of AI.
@alxndrdavies
Xander Davies
2 months
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
0
2
2
@alxndrdavies
Xander Davies
2 months
Ben and team are extremely good—e.g. check out their cool work w OpenAI!
@EdelmanBen
Ben Edelman
2 months
Did not have "red team frontier AI systems as part of federal govt job" on my bingo card during my ML PhD. It's been a blast.
1
1
16
@alxndrdavies
Xander Davies
2 months
AISI’s blog: https://t.co/pRXyjhddAd AISI’s post:
Tweet card summary image
aisi.gov.uk
Insights into our ongoing voluntary collaborations with Anthropic and OpenAI.
@AISecurityInst
AI Security Institute
2 months
We’ve been working closely with frontier AI developers, alongside the US Center of AI Standards and Innovation (CAISI), to improve the security of powerful AI systems. @AnthropicAI and @OpenAI have now published insights into these ongoing collaborations 🧵
1
0
7
@NateBurnikell
Nate
2 months
~2 years ago @alxndrdavies pitched me, in a room overlooking Big Ben, on why AISI needed a team to independently assess frontier AI safeguards, not just dangerous capability evals. The team he’s built since is world-leading. Great to have these collabs public.
@alxndrdavies
Xander Davies
2 months
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
0
3
38
@EthanJPerez
Ethan Perez
2 months
This team at UK AISI was incredible at discovering jailbreaking techniques for our prototype defenses. Their attacks helped us to make some critical decisions and achieve the robustness that we did for mitigating CBRN risks on Opus 4. They’re now hiring!
@alxndrdavies
Xander Davies
2 months
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
2
6
66
@JoHeidecke
Johannes Heidecke
2 months
Our safeguards for bio risk and agentic deployments were stress-tested by the US CAISI and UK AISI & we iterated together towards ever higher robustness and reliability: https://t.co/gT1fM3L264
Tweet card summary image
openai.com
An update on our collaboration with US and UK research and standards bodies for the secure deployment of AI.
@alxndrdavies
Xander Davies
2 months
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
0
2
8
@MrinankSharma
mrinank ⛰️
2 months
It's been great to collaborate with Xander and other at UK AISI and CAISI to strength our jailbreak defences!
@alxndrdavies
Xander Davies
2 months
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
1
2
12
@alxndrdavies
Xander Davies
2 months
These collaborations strengthen safeguards of widely used systems by drawing from relevant gov expertise, and also keep governments informed. I'm particularly excited that much of this work was done jointly with US CAISI, who have been strong technical collaborators. 5/6
1
1
23