Erik Jenner Profile
Erik Jenner

@jenner_erik

Followers
931
Following
572
Media
23
Statuses
186

Research scientist @ Google DeepMind working on AGI safety & alignment

Joined April 2016
Don't wanna be here? Send us removal request.
@rohinmshah
Rohin Shah
9 days
We built and validated an autorater that can provide a leading indicator of CoT illegibility https://t.co/0kWInlwqb7
@emmons_scott
Scott Emmons
10 days
CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes. Would we even notice if it starts slipping away? 🧵
0
6
34
@jenner_erik
Erik Jenner
17 days
I'm excited about CoT monitoring, but we don't have a ton of results characterizing when it works. Wen and Artur made interesting progress on this! E.g. in our setting, Claude models have been getting worse at obfuscating internal CoT, I'm curious whether this holds more broadly!
@imwendering
Wen X.
17 days
🔔New paper: Can reasoning models hide their reasoning? We stress-tested Chain-of-Thought (CoT) monitoring and found that while monitors detect ~96% of hidden malicious intent under normal conditions, ⚠️detection can collapse to ~10% under strong obfuscation pressure. 🧵
0
0
3
@sleight_henry
🚀Henry is launching the Astra Research Program!
1 month
🏁Applications for the Astra Fellowship close tomorrow (10/10)🏁 We’re excited to help build the next generation of AI Safety talent, across Tech & Tech Policy Research, Security, Macrostrategy, and Fieldbuilding! Apply Now!🔗⬇️
1
7
19
@davlindner
David Lindner
2 months
MATS is a great opportunity to start your career in AI safety! For MATS 9.0 I'll be running a research stream together with @emmons_scott @jenner_erik and @zimmerrol If you want to do research on AI oversight and control, apply now!
@ryan_kidd44
Ryan Kidd
2 months
MATS 9.0 applications are open! Launch your career in AI alignment, governance, and security with our 12-week research program. MATS provides field-leading research mentorship, funding, Berkeley & London offices, housing, and talks/workshops with AI experts.
0
2
20
@sleight_henry
🚀Henry is launching the Astra Research Program!
2 months
🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra — a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI 2027 and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo!
6
35
158
@METR_Evals
METR
3 months
Prior work has found that Chain of Thought (CoT) can be unfaithful. Should we then ignore what it says? In new research, we find that the CoT is informative about LLM cognition as long as the cognition is complex enough that it can’t be performed in a single forward pass.
6
36
307
@jenner_erik
Erik Jenner
4 months
The fact that current LLMs have reasonably legible chain of thought is really useful for safety (as well as for other reasons)! It would be great to keep it this way
@balesni
Mikita Balesni 🇺🇦
4 months
A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:
1
0
8
@jenner_erik
Erik Jenner
4 months
We stress tested Chain-of-Thought monitors to see how promising a defense they are against risks like scheming! I think the results are promising for CoT monitoring, and I'm very excited about this direction. But we should keep stress-testing defenses as models get more capable
@emmons_scott
Scott Emmons
4 months
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models
0
0
13
@vkrakovna
Victoria Krakovna
4 months
As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. https://t.co/PrfZcIVuEw
17
45
215
@davlindner
David Lindner
4 months
Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
11
18
103
@jenner_erik
Erik Jenner
5 months
My @MATSprogram scholar Rohan just finished a cool paper on attacking latent-space probes with RL! Going in, I was unsure whether RL could explore into probe bypassing policies, or change the activations enough. Turns out it can, but not always. Go check out the thread & paper!
@RohDGupta
Rohan Gupta
5 months
🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!
0
1
12
@sarah_cogan
Sarah Cogan
5 months
Gemini 2⃣.5⃣ technical report is out!! 🙌🥂😇
1
1
21
@dfrsrchtwts
Daniel Filan
5 months
New episode with @davlindner, covering his work on MONA! Check it out - video link in reply.
@AXRPodcast
AXRP - the AI X-risk Research Podcast
5 months
Episode 43 - David Lindner on Myopic Optimization with Non-myopic Approval https://t.co/g1YP2B0vyO
1
4
30
@bshlgrs
Buck Shlegeris
7 months
We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.
8
23
247
@RyanPGreenblatt
Ryan Greenblatt
7 months
IMO, this isn't much of an update against CoT monitoring hopes. They show unfaithfulness when the reasoning is minimal enough that it doesn't need CoT. But, my hopes for CoT monitoring are because models will have to reason a lot to end up misaligned and cause huge problems. 🧵
@AnthropicAI
Anthropic
7 months
New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
5
17
159
@jenner_erik
Erik Jenner
7 months
UK AISI is hiring, consider applying if you're interested in adversarial ML/red-teaming. Seems like a great team, and I think it's one of the best places in the world for doing adversarial ML work that's highly impactful
@alxndrdavies
Xander Davies
8 months
My team is hiring @AISecurityInst! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research & testing. 🧵 1/4
1
4
48
@jenner_erik
Erik Jenner
7 months
My colleagues @emmons_scott @davlindner and I will be mentoring a research stream on AI control/monitoring and other topics. And there are lots of other great streams from interpretability to governance. Apply by Apr 18!
0
2
12
@jenner_erik
Erik Jenner
7 months
I think for people who have technical experience and want to start working on AI safety, MATS is often one of the best things they could do. You meet a cohort of many other fantastic MATS scholars and get a lot of support for your research.
1
0
4
@jenner_erik
Erik Jenner
7 months
Fair warning: if you're a 300 IQ Being thinking 10,000x faster than humans, you might be slightly overqualified for MATS. But if you want to work on making a future where artificial 300 IQ beings exist go well, then consider applying for the upcoming cohort!🧵
@dfrsrchtwts
Daniel Filan
8 months
Nice analysis of what a 300 IQ being that thought 10,000x faster than humans could and couldn't do. The only clear conclusion is that it would apply to the MATS Summer 2025 cohort, to be mentored by the likes of @EthanJPerez, @dawnsongtweets, @jenner_erik, Mary Phuong, and more!
1
2
25
@davlindner
David Lindner
8 months
Consider applying for MATS if you're interested to work on an AI alignment research project this summer! I'm a mentor as are many of my colleagues at DeepMind
@ryan_kidd44
Ryan Kidd
8 months
@MATSprogram Summer 2025 applications close Apr 18! Come help advance the fields of AI alignment, security, and governance with mentors including @NeelNanda5 @EthanJPerez @OwainEvans_UK @EvanHub @bshlgrs @dawnsongtweets @DavidSKrueger @RichardMCNgo and more!
0
1
22