Cameron Holmes ✈️ Berkeley @CameronHolmes92 X Profile

Cameron Holmes ✈️ Berkeley

@CameronHolmes92

Followers

337

Following

29K

Media

128

Statuses

1K

Managing Alignment Research @MATSprogram Market participant, EA. Parenting like Dr Louise Banks

https://t.co/x9lfR4UjPO

London

Joined May 2014

Don't wanna be here? Send us removal request.

Cameron Holmes ✈️ Berkeley

@CameronHolmes92

2 months

Incredible work by 3x @MATSprogram alumni and a great example of applied Mech Interp beating black box baselines and making significant progress on critical real-world problems:

Oscar Balcells Obeso

@OBalcells

2 months

Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.

2

3

25

Geoffrey Irving

@geoffreyirving

6 days

The @AISecurityInst Cyber Autonomous Systems Team is hiring propensity researchers to grow the science around whether models *are likely* to attempt dangerous behaviour, as opposed to whether they are capable of doing so. Application link below! 🧵

1

11

47

sarah

@littIeramblings

14 days

come and join me on the AISI comms team! can wholeheartedly recommend AISI as a wonderful place to work - very talented teams dedicated to making AI go well and a tonne of exciting work to shout about link to apply in next tweet 😌

3

9

107

Yo Shavit

@yonashav

18 days

@sebkrier and I are pretty floored by the quality of MATS applicants

9

11

346

Julian Minder

@jkminder

24 days

New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵

2

28

156

Constantin Venhoff

@cvenhoff00

1 month

🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵

16

73

587

Ryan Kidd

@ryan_kidd44

2 months

MATS is hiring world-class researchers, managers, generalists, and more to help grow our AI safety & security talent pipeline! Apply by Oct 17 for a Dec 1 start.

4

9

48

Cameron Holmes ✈️ Berkeley

@CameronHolmes92

2 months

@MATSprogram Apply to @MATSprogram by EOD Oct 2nd (or Sept 12 for @NeelNanda5 !) to join a growing list of great researchers making real progress on some of the most important problems!

0

2

Oscar Balcells Obeso

@OBalcells

2 months

Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.

205

631

9K

Julian Minder

@jkminder

2 months

Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more

6

23

227

Cameron Holmes ✈️ Berkeley

@CameronHolmes92

2 months

I am once again asking to pause scaling frontier models

Disney

@Disney

2 months

Hooray! The all-new feature length Bluey movie is coming only to cinemas on August 6, 2027!

1

0

1

Cameron Holmes ✈️ Berkeley

@CameronHolmes92

3 months

Scholars at MATS are doing awesome research and you can apply to join them!

Ryan Kidd

@ryan_kidd44

3 months

MATS 9.0 applications are open! Launch your career in AI alignment, governance, and security with our 12-week research program. MATS provides field-leading research mentorship, funding, Berkeley & London offices, housing, and talks/workshops with AI experts.

0

6

Agus 🔎🔸

@austinc3301

3 months

This round of @SPARexec will have a whopping 80+ projects, almost double as last time. So if you were considering getting into AI safety or policy research, this is a great time to do it. Apps close August 20!

2

4

28

sarah

@littIeramblings

4 months

- alignment is urgent - it is solvable - we should try really, really hard - if you have expertise to bring to bear on this problem, you should apply for our fund! (up to £1m per project + support from the very talented AISI alignment & control teams)

AI Security Institute

@AISecurityInst

4 months

📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇️

20

13

157

Cameron Holmes ✈️ Berkeley

@CameronHolmes92

4 months

We would like to thank the reviewers for their thoughtful comments. In the following, we highlight common concerns of reviewers and our effort to address these concerns.

Peter Kyle

@peterkyle

4 months

If you want to overturn the Online Safety Act you are on the side of predators. It is as simple as that.

0

3

Helena Casademunt

@HCasademunt

4 months

Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data

9

27

162

Julian Minder

@jkminder

5 months

With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

2

8

105

Constantin Venhoff

@cvenhoff00

5 months

Can we actually control reasoning behaviors in thinking LLMs? Our @iclr_conf workshop paper is out! 🎉 We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations! Details in 🧵👇

5

27

169

Cameron Holmes ✈️ Berkeley

@CameronHolmes92

5 months

Great new paper by @RohDGupta shows that models can learn to evade probes if latent space monitors are used in RL and explores the responsible mechanisms as well as their limitations. @MATSprogram scholars rock 💪

Rohan Gupta

@RohDGupta

5 months

🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!

0

4

Erik Jenner

@jenner_erik

5 months

My @MATSprogram scholar Rohan just finished a cool paper on attacking latent-space probes with RL! Going in, I was unsure whether RL could explore into probe bypassing policies, or change the activations enough. Turns out it can, but not always. Go check out the thread & paper!

Rohan Gupta

@RohDGupta

5 months

🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!

0

1

12