Robert Kirk Profile
Robert Kirk

@_robertkirk

Followers
1K
Following
590
Media
48
Statuses
361

Research Scientist at @AISecurityInst; PhD Student @ucl_dark. LLMs, AI Safety, Generalisation; @Effect_altruism

Joined January 2020
Don't wanna be here? Send us removal request.
@_robertkirk
Robert Kirk
1 year
Life update: I’ve joined the UK @AISafetyInst on the Safeguards Analysis team! We’re working on how to conceptually and empirically assess whether a set of safeguards used to reduce risk is sufficient.
10
6
128
@_robertkirk
Robert Kirk
4 days
Huge kudos to @StephenLCasper and Kyle O'Brien for leading this work. See the @AISecurityInst thread, and the paper and website below:.
Tweet card summary image
deepignorance.ai
Filtering pretraining data prevents dangerous capabilities, doesn’t sacrifice general performance, and results in models that are resistant to tampering.
@AISecurityInst
AI Security Institute
4 days
How can open-weight Large Language Models be safeguarded against malicious uses?. In our new paper with @AiEleuther, we find that removing harmful data before training can be over 10x more effective at resisting adversarial fine-tuning than defences added after training 🧵
Tweet media one
0
0
3
@grok
Grok
5 days
Generate videos in just a few seconds. Try Grok Imagine, free for a limited time.
359
642
3K
@_robertkirk
Robert Kirk
4 days
Very excited for this work to be out. We do large-scale empirical experiments on data filtering for harmful knowledge, and find that it improves tamper resistance >10x compared with existing methods, while being cheap (<1% pretraining compute) and not degrading capabilities!.
@StephenLCasper
Cas (Stephen Casper)
4 days
🧵 New paper from @AISecurityInst x @AiEleuther that I led with Kyle O’Brien:. Open-weight LLM safety is both important & neglected. But we show that filtering dual-use knowledge from pre-training data improves tamper resistance *>10x* over post-training baselines.
Tweet media one
1
4
16
@_robertkirk
Robert Kirk
30 days
RT @alxndrdavies: We at @AISecurityInst worked with @OpenAI to test & improve Agent’s safeguards prior to release. A few notes on our exper….
0
29
0
@_robertkirk
Robert Kirk
1 month
New work out: We demonstrate a new attack against stacked safeguards and analyse defence in depth strategies. Excited for this joint collab between @farairesearch and @AISecurityInst to be out!.
@farairesearch
FAR.AI
1 month
1/."Swiss cheese security", stacking layers of imperfect defenses, is a key part of AI companies' plans to safeguard models, and is used to secure Anthropic's Opus 4 model. Our new STACK attack breaks each layer in turn, highlighting this approach may be less secure than hoped.
0
4
21
@_robertkirk
Robert Kirk
2 months
RT @RylanSchaeffer: A bit late to the party, but our paper on predictable inference-time / test-time scaling was accepted to #icml2025 🎉🎉🎉….
0
21
0
@_robertkirk
Robert Kirk
2 months
RT @A_v_i__S: Ever tried to tell if someone really forgot your birthday?. evaluating forgetting is tricky. Now imagine doing that… bu….
0
8
0
@_robertkirk
Robert Kirk
3 months
RT @sahar_abdelnabi: Hawthorne effect describes how study participants modify their behavior if they know they are being observed. In our p….
0
21
0
@_robertkirk
Robert Kirk
3 months
RT @_robertkirk: New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We l….
0
11
0
@_robertkirk
Robert Kirk
3 months
RT @geoffreyirving: Now all three of AISI's Safeguards, Control, and Alignment Teams have a paper sketching safety cases for technical miti….
0
4
0
@_robertkirk
Robert Kirk
3 months
Full paper:.Blog post:. This was joint work with @joshua_clymer, Jonah Weinbaum, @kimberlytmai, @selenazhxng and @alxndrdavies.
Tweet card summary image
aisi.gov.uk
An Example Safety Case for Safeguards Against Misuse
0
0
3
@_robertkirk
Robert Kirk
3 months
We're excited to see work in this space, especially on improving safeguard evaluation techniques. Apply to our Challenge Fund if you're interested in pursuing this line of research: .
Tweet card summary image
aisi.gov.uk
View AISI grants. The AI Security Institute is a directorate of the Department of Science, Innovation, and Technology that facilitates rigorous research to enable advanced AI governance.
@AISecurityInst
AI Security Institute
5 months
🚨 Introducing the AISI Challenge Fund: £5 million to advance AI security & safety research. Grants of up to £200,000 are available for innovative AI research on technical mitigations, improved evaluations, and stronger safeguards. 🛡️🤖
Tweet media one
1
1
5
@_robertkirk
Robert Kirk
3 months
This safety case represents just one path for justifying AI misuse risk is maintained below a threshold, but one could also argue model capabilities are insufficient to produce risk (quote tweet), or that models improve societal resilience to AI misuse (..
Tweet card summary image
aisi.gov.uk
20 Systemic Safety Grant Awardees Announced
@AISecurityInst
AI Security Institute
9 months
Our new paper on safety cases, in collaboration with @GovAI_ shows how it’s possible to write safety cases for current systems, using existing techniques. We hope to see organisations using templates like this for their models.
Tweet media one
1
0
2
@_robertkirk
Robert Kirk
3 months
Finally, we combine all of the previous components into an overall safety case, which enables making a clear, structured and compelling end-to-end argument that connects technical evaluations with real-world deployment decisions.
Tweet media one
1
0
2
@_robertkirk
Robert Kirk
3 months
We suggest 3 procedural policies developers could adopt to enable management of risk during deployment. Our case uses a continuous signal of risk during deployment (e.g. through a jailbreak bounty) to enable developers to respond through improving safeguards or adjusting access.
Tweet media one
1
0
2
@_robertkirk
Robert Kirk
3 months
This estimate of risk could then be used to make deployment or access decisions. However, risk can change during deployment, e.g. as new methods for subverting safeguards (jailbreaking) are discovered.
1
0
2
@_robertkirk
Robert Kirk
3 months
Next we describe our "uplift model": a method for combining safeguard evals with threat modelling and capability evals to estimate the risk posed from the deployment of the hypothetical safeguarded AI system. Play with an interactive version here:
Tweet media one
1
0
3
@_robertkirk
Robert Kirk
3 months
For safeguard evaluation, we build on our previous Principles for Safeguard Evaluation (.. This involves tasking a red team with trying to fulfil harmful requests to an AI assistant necessary for a pathway to harm identified by threat modelling.
Tweet media one
2
0
2
@_robertkirk
Robert Kirk
3 months
The paper lays out an example safety case for how developers could justify that safeguards lower the risk of misuse below some pre-specified threshold. This involves three key components:.1. Safeguard evaluation.2. Uplift modelling.3. Managing risk in deployment.
1
0
2
@_robertkirk
Robert Kirk
3 months
We've done a lot of safeguard evaluations @AISecurityInst, but a key question has always been how best to make these evaluations actionable-and decision-relevant. This paper is a first step towards addressing that issue:
Tweet card summary image
arxiv.org
Existing evaluations of AI misuse safeguards provide a patchwork of evidence that is often difficult to connect to real-world decisions. To bridge this gap, we describe an end-to-end argument (a...
1
0
4
@_robertkirk
Robert Kirk
3 months
New paper! With @joshua_clymer, Jonah Weinbaum and others, we’ve written a safety case for safeguards against misuse. We lay out how developers can connect safeguard evaluation results to real-world decisions about how to deploy models. 🧵
Tweet media one
4
11
47