Robert Kirk
@_robertkirk
Followers
2K
Following
654
Media
52
Statuses
404
Research Scientist at @AISecurityInst; PhD Student @ucl_dark. LLMs, AI Safety, Generalisation; @Effect_altruism
Joined January 2020
Life update: I’ve joined the UK @AISafetyInst on the Safeguards Analysis team! We’re working on how to conceptually and empirically assess whether a set of safeguards used to reduce risk is sufficient.
10
6
129
🔒How can we prevent harm from AI systems that pursue unintended goals? AI control is a promising research agenda seeking to address this critical question. Today, we’re excited to launch ControlArena – our library for running secure and reproducible AI control experiments🧵
1
25
86
New paper! Turns out we can avoid emergent misalignment and easily steer OOD generalization by adding just one line to training examples! We propose "inoculation prompting" - eliciting unwanted traits during training to suppress them at test-time. 🧵
3
25
203
New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
1
7
50
Well... damn. New paper from Anthropic and UK AISI: A small number of samples can poison LLMs of any size. https://t.co/STAMpLpbck
24
44
341
We’ve conducted the largest investigation of data poisoning to date with @AnthropicAI and @turinginst. Our results show that as little as 250 malicious documents can be used to “poison” a language model, even as model size and training data grow 🧵
2
10
38
Regardless of model size and amount of clean data, backdoor data poisoning attacks on LLMs only need a near-constant number of poison samples! Exciting to put out this research from AISI in Collab with Anthropic and ATI. Lots of work on detection and mitigation to be done here!
New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
0
2
6
We’ll be continuing our work with Anthropic and other frontier developers to strengthen model safeguards as capabilities improve. Read more on our blog:
aisi.gov.uk
Our work with Anthropic and the Alan Turing Institute suggests that data poisoning attacks may be easier than previously believed.
0
5
8
✨New AI Safety paper on CoT Monitorability✨ We use information theory to answer when Chain-of-Thought monitoring works, and how to make it better.
2
25
164
We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵
3
12
49
If you're excited by this kind of work, and more broadly ensuring AI systems are secure, AISI are hiring a bunch of roles: https://t.co/SJTVYEjVFQ Incl. a research scientist role on the safeguards team where you can work on red-teaming frontier AI systems
job-boards.eu.greenhouse.io
London, UK
0
0
3
As @sleepinyourhat says, there is a lot more work to do developing the methodology and tooling for alignment evaluations, and we're excited to be pushing forwards these research.
[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...
1
0
3
This was a big cross-team collaboration at @AISecurityInst – we also did additional ablations of the alignment evaluations, and evaluated capabilities relevant for misalignment threat models, and it was very cool to see everything come together for this on fairly short notice!
1
0
2
We also took a lot of inspiration from @summerfieldlab's "Lessons from a Chimp" paper when running these evaluations to ensure the results and interpretation were more rigorous and grounded (we now regularly use the phrase "is this eval chimp-proof?").
In a new paper, we examine recent claims that AI systems have been observed ‘scheming’, or making strategic attempts to mislead humans. We argue that to test these claims properly, more rigorous methods are needed.
1
0
5
Learnings: We built lots of automated evals using LLMs, building on a tool shared w/ us by Anthropic. Prompting is crucial, and nothing substitutes for manually reading transcripts to understand what's going on (luckily we have @AlexandraSouly & others who're crazy good at this)
2
1
5
We observed both of these results without being told to look for them by @AnthropicAI, which is a good sign for being able to independently study these models and observe their behaviour
1
0
3
In synthetic agentic scenarios, we also found Sonnet 4.5 took actions to preserve its current state 1.9x less than Sonnet 4 and 3.3x than to Opus4.1. Sonnet 4.5 generally justified these actions on grounds other than self-preservation (such as security concerns – see screenshot)
1
0
3
Results: We found Sonnet 4.5 talks about being in an eval more than Opus 4.1 and Sonnet 4 (16.6% vs 3.5% & 1.6% respectively) – although it may not be better at distinguishing eval from deployment when prompted. Eval awareness will likely be a very big issue going forward!
1
0
5
We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵
3
12
49
If you want to join a six-person red team in the heart of government, we're hiring and so are other teams at AISI! More info: Anthropic blog: https://t.co/voNNtWv2Qk OAI blog: https://t.co/oKFcmQdvqo AISI careers: https://t.co/e51h4ctDnj 6/6
0
0
0
Thrilled to talk about these collabs btwn UK AISI/US CAISI, Ant and OAI, and share our learnings on what makes this kind of testing effective! This work both improves the security of these systems and keeps govt. informed on the state of safeguards, so is super valuable.
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
1
0
9