Cameron Holmes ✈️ Berkeley
@CameronHolmes92
Followers
337
Following
29K
Media
128
Statuses
1K
Managing Alignment Research @MATSprogram Market participant, EA. Parenting like Dr Louise Banks
London
Joined May 2014
Incredible work by 3x @MATSprogram alumni and a great example of applied Mech Interp beating black box baselines and making significant progress on critical real-world problems:
Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.
2
3
25
The @AISecurityInst Cyber Autonomous Systems Team is hiring propensity researchers to grow the science around whether models *are likely* to attempt dangerous behaviour, as opposed to whether they are capable of doing so. Application link below! 🧵
1
11
47
come and join me on the AISI comms team! can wholeheartedly recommend AISI as a wonderful place to work - very talented teams dedicated to making AI go well and a tonne of exciting work to shout about link to apply in next tweet 😌
3
9
107
New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵
2
28
156
🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵
16
73
587
MATS is hiring world-class researchers, managers, generalists, and more to help grow our AI safety & security talent pipeline! Apply by Oct 17 for a Dec 1 start.
4
9
48
@MATSprogram Apply to @MATSprogram by EOD Oct 2nd (or Sept 12 for @NeelNanda5 !) to join a growing list of great researchers making real progress on some of the most important problems!
0
0
2
Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.
205
631
9K
Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more
6
23
227
Scholars at MATS are doing awesome research and you can apply to join them!
MATS 9.0 applications are open! Launch your career in AI alignment, governance, and security with our 12-week research program. MATS provides field-leading research mentorship, funding, Berkeley & London offices, housing, and talks/workshops with AI experts.
0
0
6
This round of @SPARexec will have a whopping 80+ projects, almost double as last time. So if you were considering getting into AI safety or policy research, this is a great time to do it. Apps close August 20!
2
4
28
- alignment is urgent - it is solvable - we should try really, really hard - if you have expertise to bring to bear on this problem, you should apply for our fund! (up to £1m per project + support from the very talented AISI alignment & control teams)
📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇️
20
13
157
Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data
9
27
162
With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
2
8
105
Can we actually control reasoning behaviors in thinking LLMs? Our @iclr_conf workshop paper is out! 🎉 We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations! Details in 🧵👇
5
27
169
Great new paper by @RohDGupta shows that models can learn to evade probes if latent space monitors are used in RL and explores the responsible mechanisms as well as their limitations. @MATSprogram scholars rock 💪
🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!
0
0
4
My @MATSprogram scholar Rohan just finished a cool paper on attacking latent-space probes with RL! Going in, I was unsure whether RL could explore into probe bypassing policies, or change the activations enough. Turns out it can, but not always. Go check out the thread & paper!
🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!
0
1
12