Robert Kirk Profile
Robert Kirk

@_robertkirk

Followers
2K
Following
654
Media
52
Statuses
404

Research Scientist at @AISecurityInst; PhD Student @ucl_dark. LLMs, AI Safety, Generalisation; @Effect_altruism

Joined January 2020
Don't wanna be here? Send us removal request.
@_robertkirk
Robert Kirk
1 year
Life update: I’ve joined the UK @AISafetyInst on the Safeguards Analysis team! We’re working on how to conceptually and empirically assess whether a set of safeguards used to reduce risk is sufficient.
10
6
129
@AISecurityInst
AI Security Institute
18 days
🔒How can we prevent harm from AI systems that pursue unintended goals? AI control is a promising research agenda seeking to address this critical question. Today, we’re excited to launch ControlArena – our library for running secure and reproducible AI control experiments🧵
1
25
86
@DanielCHTan97
Daniel Tan
1 month
New paper! Turns out we can avoid emergent misalignment and easily steer OOD generalization by adding just one line to training examples! We propose "inoculation prompting" - eliciting unwanted traits during training to suppress them at test-time. 🧵
3
25
203
@AlexandraSouly
Alexandra Souly
1 month
New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
1
7
50
@robertwiblin
Rob Wiblin
1 month
Well... damn. New paper from Anthropic and UK AISI: A small number of samples can poison LLMs of any size. https://t.co/STAMpLpbck
24
44
341
@AISecurityInst
AI Security Institute
1 month
We’ve conducted the largest investigation of data poisoning to date with @AnthropicAI and @turinginst. Our results show that as little as 250 malicious documents can be used to “poison” a language model, even as model size and training data grow 🧵
2
10
38
@_robertkirk
Robert Kirk
1 month
Regardless of model size and amount of clean data, backdoor data poisoning attacks on LLMs only need a near-constant number of poison samples! Exciting to put out this research from AISI in Collab with Anthropic and ATI. Lots of work on detection and mitigation to be done here!
@AlexandraSouly
Alexandra Souly
1 month
New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11
0
2
6
@AISecurityInst
AI Security Institute
1 month
We’ll be continuing our work with Anthropic and other frontier developers to strengthen model safeguards as capabilities improve. Read more on our blog:
Tweet card summary image
aisi.gov.uk
Our work with Anthropic and the Alan Turing Institute suggests that data poisoning attacks may be easier than previously believed.
0
5
8
@usmananwar391
Usman Anwar
1 month
✨New AI Safety paper on CoT Monitorability✨ We use information theory to answer when Chain-of-Thought monitoring works, and how to make it better.
2
25
164
@_robertkirk
Robert Kirk
1 month
We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵
3
12
49
@_robertkirk
Robert Kirk
1 month
If you're excited by this kind of work, and more broadly ensuring AI systems are secure, AISI are hiring a bunch of roles: https://t.co/SJTVYEjVFQ Incl. a research scientist role on the safeguards team where you can work on red-teaming frontier AI systems
Tweet card summary image
job-boards.eu.greenhouse.io
London, UK
0
0
3
@_robertkirk
Robert Kirk
1 month
As @sleepinyourhat says, there is a lot more work to do developing the methodology and tooling for alignment evaluations, and we're excited to be pushing forwards these research.
@sleepinyourhat
Sam Bowman
1 month
[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...
1
0
3
@_robertkirk
Robert Kirk
1 month
This was a big cross-team collaboration at @AISecurityInst – we also did additional ablations of the alignment evaluations, and evaluated capabilities relevant for misalignment threat models, and it was very cool to see everything come together for this on fairly short notice!
1
0
2
@_robertkirk
Robert Kirk
1 month
We also took a lot of inspiration from @summerfieldlab's "Lessons from a Chimp" paper when running these evaluations to ensure the results and interpretation were more rigorous and grounded (we now regularly use the phrase "is this eval chimp-proof?").
@summerfieldlab
summerfieldlab @summerfieldlab.bsky.social
4 months
In a new paper, we examine recent claims that AI systems have been observed ‘scheming’, or making strategic attempts to mislead humans. We argue that to test these claims properly, more rigorous methods are needed.
1
0
5
@_robertkirk
Robert Kirk
1 month
Learnings: We built lots of automated evals using LLMs, building on a tool shared w/ us by Anthropic. Prompting is crucial, and nothing substitutes for manually reading transcripts to understand what's going on (luckily we have @AlexandraSouly & others who're crazy good at this)
2
1
5
@_robertkirk
Robert Kirk
1 month
We observed both of these results without being told to look for them by @AnthropicAI, which is a good sign for being able to independently study these models and observe their behaviour
1
0
3
@_robertkirk
Robert Kirk
1 month
In synthetic agentic scenarios, we also found Sonnet 4.5 took actions to preserve its current state 1.9x less than Sonnet 4 and 3.3x than to Opus4.1. Sonnet 4.5 generally justified these actions on grounds other than self-preservation (such as security concerns – see screenshot)
1
0
3
@_robertkirk
Robert Kirk
1 month
Results: We found Sonnet 4.5 talks about being in an eval more than Opus 4.1 and Sonnet 4 (16.6% vs 3.5% & 1.6% respectively) – although it may not be better at distinguishing eval from deployment when prompted. Eval awareness will likely be a very big issue going forward!
1
0
5
@_robertkirk
Robert Kirk
1 month
We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵
3
12
49
@_robertkirk
Robert Kirk
2 months
@alxndrdavies
Xander Davies
2 months
If you want to join a six-person red team in the heart of government, we're hiring and so are other teams at AISI! More info: Anthropic blog: https://t.co/voNNtWv2Qk OAI blog: https://t.co/oKFcmQdvqo AISI careers: https://t.co/e51h4ctDnj 6/6
0
0
0
@_robertkirk
Robert Kirk
2 months
Thrilled to talk about these collabs btwn UK AISI/US CAISI, Ant and OAI, and share our learnings on what makes this kind of testing effective! This work both improves the security of these systems and keeps govt. informed on the state of safeguards, so is super valuable.
@alxndrdavies
Xander Davies
2 months
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
1
0
9