
Robert Kirk
@_robertkirk
Followers
1K
Following
590
Media
48
Statuses
361
Research Scientist at @AISecurityInst; PhD Student @ucl_dark. LLMs, AI Safety, Generalisation; @Effect_altruism
Joined January 2020
Life update: I’ve joined the UK @AISafetyInst on the Safeguards Analysis team! We’re working on how to conceptually and empirically assess whether a set of safeguards used to reduce risk is sufficient.
10
6
128
Huge kudos to @StephenLCasper and Kyle O'Brien for leading this work. See the @AISecurityInst thread, and the paper and website below:.
deepignorance.ai
Filtering pretraining data prevents dangerous capabilities, doesn’t sacrifice general performance, and results in models that are resistant to tampering.
How can open-weight Large Language Models be safeguarded against malicious uses?. In our new paper with @AiEleuther, we find that removing harmful data before training can be over 10x more effective at resisting adversarial fine-tuning than defences added after training 🧵
0
0
3
Very excited for this work to be out. We do large-scale empirical experiments on data filtering for harmful knowledge, and find that it improves tamper resistance >10x compared with existing methods, while being cheap (<1% pretraining compute) and not degrading capabilities!.
🧵 New paper from @AISecurityInst x @AiEleuther that I led with Kyle O’Brien:. Open-weight LLM safety is both important & neglected. But we show that filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.
1
4
16
RT @alxndrdavies: We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our exper….
0
29
0
New work out: We demonstrate a new attack against stacked safeguards and analyse defence in depth strategies. Excited for this joint collab between @farairesearch and @AISecurityInst to be out!.
1/."Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
0
4
21
RT @RylanSchaeffer: A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉….
0
21
0
RT @A_v_i__S: Ever tried to tell if someone really forgot your birthday?. evaluating forgetting is tricky. Now imagine doing that… bu….
0
8
0
RT @sahar_abdelnabi: Hawthorne effect describes how study participants modify their behavior if they know they are being observed. In our p….
0
21
0
RT @_robertkirk: New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We l….
0
11
0
RT @geoffreyirving: Now all three of AISI's Safeguards, Control, and Alignment Teams have a paper sketching safety cases for technical miti….
0
4
0
Full paper:.Blog post:. This was joint work with @joshua_clymer, Jonah Weinbaum, @kimberlytmai, @selenazhxng and @alxndrdavies.
aisi.gov.uk
An Example Safety Case for Safeguards Against Misuse
0
0
3
We're excited to see work in this space, especially on improving safeguard evaluation techniques. Apply to our Challenge Fund if you're interested in pursuing this line of research: .
aisi.gov.uk
View AISI grants. The AI Security Institute is a directorate of the Department of Science, Innovation, and Technology that facilitates rigorous research to enable advanced AI governance.
🚨 Introducing the AISI Challenge Fund: £5 million to advance AI security & safety research. Grants of up to £200,000 are available for innovative AI research on technical mitigations, improved evaluations, and stronger safeguards. 🛡️🤖
1
1
5
This safety case represents just one path for justifying AI misuse risk is maintained below a threshold, but one could also argue model capabilities are insufficient to produce risk (quote tweet), or that models improve societal resilience to AI misuse (..
aisi.gov.uk
20 Systemic Safety Grant Awardees Announced
Our new paper on safety cases, in collaboration with @GovAI_ shows how it’s possible to write safety cases for current systems, using existing techniques. We hope to see organisations using templates like this for their models.
1
0
2
We've done a lot of safeguard evaluations @AISecurityInst, but a key question has always been how best to make these evaluations actionable-and decision-relevant. This paper is a first step towards addressing that issue:
arxiv.org
Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a...
1
0
4
New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵
4
11
47