CameronHolmes92 Profile Banner
Cameron Holmes ✈️ Berkeley Profile
Cameron Holmes ✈️ Berkeley

@CameronHolmes92

Followers
337
Following
29K
Media
128
Statuses
1K

Managing Alignment Research @MATSprogram Market participant, EA. Parenting like Dr Louise Banks

London
Joined May 2014
Don't wanna be here? Send us removal request.
@CameronHolmes92
Cameron Holmes ✈️ Berkeley
2 months
Incredible work by 3x @MATSprogram alumni and a great example of applied Mech Interp beating black box baselines and making significant progress on critical real-world problems:
@OBalcells
Oscar Balcells Obeso
2 months
Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.
2
3
25
@geoffreyirving
Geoffrey Irving
6 days
The @AISecurityInst Cyber Autonomous Systems Team is hiring propensity researchers to grow the science around whether models *are likely* to attempt dangerous behaviour, as opposed to whether they are capable of doing so. Application link below! 🧵
1
11
47
@littIeramblings
sarah
14 days
come and join me on the AISI comms team! can wholeheartedly recommend AISI as a wonderful place to work - very talented teams dedicated to making AI go well and a tonne of exciting work to shout about link to apply in next tweet 😌
3
9
107
@yonashav
Yo Shavit
18 days
@sebkrier and I are pretty floored by the quality of MATS applicants
9
11
346
@jkminder
Julian Minder
24 days
New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! 🧵
2
28
156
@cvenhoff00
Constantin Venhoff
1 month
🚨 What do reasoning models actually learn during training? Our new paper shows base models already contain reasoning mechanisms, thinking models learn when to use them! By invoking those skills at the right time in the base model, we recover up to 91% of the performance gap 🧵
16
73
587
@ryan_kidd44
Ryan Kidd
2 months
MATS is hiring world-class researchers, managers, generalists, and more to help grow our AI safety & security talent pipeline! Apply by Oct 17 for a Dec 1 start.
4
9
48
@CameronHolmes92
Cameron Holmes ✈️ Berkeley
2 months
@MATSprogram Apply to @MATSprogram by EOD Oct 2nd (or Sept 12 for @NeelNanda5 !) to join a growing list of great researchers making real progress on some of the most important problems!
0
0
2
@OBalcells
Oscar Balcells Obeso
2 months
Imagine if ChatGPT highlighted every word it wasn't sure about. We built a streaming hallucination detector that flags hallucinations in real-time.
205
631
9K
@jkminder
Julian Minder
2 months
Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more
6
23
227
@CameronHolmes92
Cameron Holmes ✈️ Berkeley
2 months
I am once again asking to pause scaling frontier models
@Disney
Disney
2 months
Hooray! The all-new feature length Bluey movie is coming only to cinemas on August 6, 2027!
1
0
1
@CameronHolmes92
Cameron Holmes ✈️ Berkeley
3 months
Scholars at MATS are doing awesome research and you can apply to join them!
@ryan_kidd44
Ryan Kidd
3 months
MATS 9.0 applications are open! Launch your career in AI alignment, governance, and security with our 12-week research program. MATS provides field-leading research mentorship, funding, Berkeley & London offices, housing, and talks/workshops with AI experts.
0
0
6
@austinc3301
Agus 🔎🔸
3 months
This round of @SPARexec will have a whopping 80+ projects, almost double as last time. So if you were considering getting into AI safety or policy research, this is a great time to do it. Apps close August 20!
2
4
28
@littIeramblings
sarah
4 months
- alignment is urgent - it is solvable - we should try really, really hard - if you have expertise to bring to bear on this problem, you should apply for our fund! (up to £1m per project + support from the very talented AISI alignment & control teams)
@AISecurityInst
AI Security Institute
4 months
📢Introducing the Alignment Project: A new fund for research on urgent challenges in AI alignment and control, backed by over £15 million. ▶️ Up to £1 million per project ▶️ Compute access, venture capital investment, and expert support Learn more and apply ⬇️
20
13
157
@CameronHolmes92
Cameron Holmes ✈️ Berkeley
4 months
We would like to thank the reviewers for their thoughtful comments. In the following, we highlight common concerns of reviewers and our effort to address these concerns.
@peterkyle
Peter Kyle
4 months
If you want to overturn the Online Safety Act you are on the side of predators. It is as simple as that.
0
0
3
@HCasademunt
Helena Casademunt
4 months
Problem: Train LLM on insecure code → it becomes broadly misaligned Solution: Add safety data? What if you can't? Use interpretability! We remove misaligned concepts during finetuning to steer OOD generalization We reduce emergent misalignment 10x w/o modifying training data
9
27
162
@jkminder
Julian Minder
5 months
With @Butanium_ and @NeelNanda5 we've just published a post on model diffing that extends our previous paper. Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
2
8
105
@cvenhoff00
Constantin Venhoff
5 months
Can we actually control reasoning behaviors in thinking LLMs? Our @iclr_conf workshop paper is out! 🎉 We show how to steer DeepSeek-R1-Distill’s reasoning: make it backtrack, add knowledge, test examples. Just by adding steering vectors to its activations! Details in 🧵👇
5
27
169
@CameronHolmes92
Cameron Holmes ✈️ Berkeley
5 months
Great new paper by @RohDGupta shows that models can learn to evade probes if latent space monitors are used in RL and explores the responsible mechanisms as well as their limitations. @MATSprogram scholars rock 💪
@RohDGupta
Rohan Gupta
5 months
🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!
0
0
4
@jenner_erik
Erik Jenner
5 months
My @MATSprogram scholar Rohan just finished a cool paper on attacking latent-space probes with RL! Going in, I was unsure whether RL could explore into probe bypassing policies, or change the activations enough. Turns out it can, but not always. Go check out the thread & paper!
@RohDGupta
Rohan Gupta
5 months
🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!
0
1
12