Robert Kirk @_robertkirk X Profile

Robert Kirk

@_robertkirk

Followers

2K

Following

654

Media

52

Statuses

404

Research Scientist at @AISecurityInst; PhD Student @ucl_dark. LLMs, AI Safety, Generalisation; @Effect_altruism

https://t.co/rHykZtmod5

Joined January 2020

Don't wanna be here? Send us removal request.

Robert Kirk

@_robertkirk

1 year

Life update: I’ve joined the UK @AISafetyInst on the Safeguards Analysis team! We’re working on how to conceptually and empirically assess whether a set of safeguards used to reduce risk is sufficient.

10

6

129

AI Security Institute

@AISecurityInst

18 days

🔒How can we prevent harm from AI systems that pursue unintended goals? AI control is a promising research agenda seeking to address this critical question. Today, we’re excited to launch ControlArena – our library for running secure and reproducible AI control experiments🧵

1

25

86

Daniel Tan

@DanielCHTan97

1 month

New paper! Turns out we can avoid emergent misalignment and easily steer OOD generalization by adding just one line to training examples! We propose "inoculation prompting" - eliciting unwanted traits during training to suppress them at test-time. 🧵

3

25

203

Alexandra Souly

@AlexandraSouly

1 month

New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11

1

7

50

Rob Wiblin

@robertwiblin

1 month

Well... damn. New paper from Anthropic and UK AISI: A small number of samples can poison LLMs of any size. https://t.co/STAMpLpbck

24

44

341

AI Security Institute

@AISecurityInst

1 month

We’ve conducted the largest investigation of data poisoning to date with @AnthropicAI and @turinginst. Our results show that as little as 250 malicious documents can be used to “poison” a language model, even as model size and training data grow 🧵

2

10

38

Robert Kirk

@_robertkirk

1 month

Regardless of model size and amount of clean data, backdoor data poisoning attacks on LLMs only need a near-constant number of poison samples! Exciting to put out this research from AISI in Collab with Anthropic and ATI. Lots of work on detection and mitigation to be done here!

Alexandra Souly

@AlexandraSouly

1 month

New @AISecurityInst research with @AnthropicAI + @turinginst: The number of samples needed to backdoor poison LLMs stays nearly CONSTANT as models scale. With 500 samples, we insert backdoors in LLMs from 600m to 13b params, even as data scaled 20x.🧵/11

0

2

6

AI Security Institute

@AISecurityInst

1 month

We’ll be continuing our work with Anthropic and other frontier developers to strengthen model safeguards as capabilities improve. Read more on our blog:

aisi.gov.uk

Our work with Anthropic and the Alan Turing Institute suggests that data poisoning attacks may be easier than previously believed.

0

5

8

Usman Anwar

@usmananwar391

1 month

✨New AI Safety paper on CoT Monitorability✨ We use information theory to answer when Chain-of-Thought monitoring works, and how to make it better.

2

25

164

Robert Kirk

@_robertkirk

1 month

We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵

3

12

49

Robert Kirk

@_robertkirk

1 month

If you're excited by this kind of work, and more broadly ensuring AI systems are secure, AISI are hiring a bunch of roles: https://t.co/SJTVYEjVFQ Incl. a research scientist role on the safeguards team where you can work on red-teaming frontier AI systems

job-boards.eu.greenhouse.io

London, UK

0

3

Robert Kirk

@_robertkirk

1 month

As @sleepinyourhat says, there is a lot more work to do developing the methodology and tooling for alignment evaluations, and we're excited to be pushing forwards these research.

Sam Bowman

@sleepinyourhat

1 month

[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...

1

0

3

Robert Kirk

@_robertkirk

1 month

This was a big cross-team collaboration at @AISecurityInst – we also did additional ablations of the alignment evaluations, and evaluated capabilities relevant for misalignment threat models, and it was very cool to see everything come together for this on fairly short notice!

1

0

2

Robert Kirk

@_robertkirk

1 month

We also took a lot of inspiration from @summerfieldlab's "Lessons from a Chimp" paper when running these evaluations to ensure the results and interpretation were more rigorous and grounded (we now regularly use the phrase "is this eval chimp-proof?").

summerfieldlab @summerfieldlab.bsky.social

@summerfieldlab

4 months

In a new paper, we examine recent claims that AI systems have been observed ‘scheming’, or making strategic attempts to mislead humans. We argue that to test these claims properly, more rigorous methods are needed.

1

0

5

Robert Kirk

@_robertkirk

1 month

Learnings: We built lots of automated evals using LLMs, building on a tool shared w/ us by Anthropic. Prompting is crucial, and nothing substitutes for manually reading transcripts to understand what's going on (luckily we have @AlexandraSouly & others who're crazy good at this)

2

1

5

Robert Kirk

@_robertkirk

1 month

We observed both of these results without being told to look for them by @AnthropicAI, which is a good sign for being able to independently study these models and observe their behaviour

1

0

3

Robert Kirk

@_robertkirk

1 month

In synthetic agentic scenarios, we also found Sonnet 4.5 took actions to preserve its current state 1.9x less than Sonnet 4 and 3.3x than to Opus4.1. Sonnet 4.5 generally justified these actions on grounds other than self-preservation (such as security concerns – see screenshot)

1

0

3

Robert Kirk

@_robertkirk

1 month

Results: We found Sonnet 4.5 talks about being in an eval more than Opus 4.1 and Sonnet 4 (16.6% vs 3.5% & 1.6% respectively) – although it may not be better at distinguishing eval from deployment when prompted. Eval awareness will likely be a very big issue going forward!

1

0

5

Robert Kirk

@_robertkirk

1 month

We at @AISecurityInst recently did our first pre-deployment 𝗮𝗹𝗶𝗴𝗻𝗺𝗲𝗻𝘁 evaluation of @AnthropicAI's Claude Sonnet 4.5! This was a first attempt – and we plan to work on this more! – but we still found some interesting results, and some learnings for next time 🧵

3

12

49

Robert Kirk

@_robertkirk

2 months

https://t.co/wRv2N3xccR

Xander Davies

@alxndrdavies

2 months

If you want to join a six-person red team in the heart of government, we're hiring and so are other teams at AISI! More info: Anthropic blog: https://t.co/voNNtWv2Qk OAI blog: https://t.co/oKFcmQdvqo AISI careers: https://t.co/e51h4ctDnj 6/6

0

Robert Kirk

@_robertkirk

2 months

Thrilled to talk about these collabs btwn UK AISI/US CAISI, Ant and OAI, and share our learnings on what makes this kind of testing effective! This work both improves the security of these systems and keeps govt. informed on the state of safeguards, so is super valuable.

Xander Davies

@alxndrdavies

2 months

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

1

0

9