Erik Jenner @jenner_erik X Profile

Erik Jenner

@jenner_erik

Followers

931

Following

572

Media

23

Statuses

186

Research scientist @ Google DeepMind working on AGI safety & alignment

https://t.co/oCJIr7V1kZ

Joined April 2016

Don't wanna be here? Send us removal request.

Rohin Shah

@rohinmshah

9 days

We built and validated an autorater that can provide a leading indicator of CoT illegibility https://t.co/0kWInlwqb7

Scott Emmons

@emmons_scott

10 days

CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes. Would we even notice if it starts slipping away? 🧵

0

6

34

Erik Jenner

@jenner_erik

17 days

I'm excited about CoT monitoring, but we don't have a ton of results characterizing when it works. Wen and Artur made interesting progress on this! E.g. in our setting, Claude models have been getting worse at obfuscating internal CoT, I'm curious whether this holds more broadly!

Wen X.

@imwendering

17 days

🔔New paper: Can reasoning models hide their reasoning? We stress-tested Chain-of-Thought (CoT) monitoring and found that while monitors detect ~96% of hidden malicious intent under normal conditions, ⚠️detection can collapse to ~10% under strong obfuscation pressure. 🧵

0

3

🚀Henry is launching the Astra Research Program!

@sleight_henry

1 month

🏁Applications for the Astra Fellowship close tomorrow (10/10)🏁 We’re excited to help build the next generation of AI Safety talent, across Tech & Tech Policy Research, Security, Macrostrategy, and Fieldbuilding! Apply Now!🔗⬇️

1

7

19

David Lindner

@davlindner

2 months

MATS is a great opportunity to start your career in AI safety! For MATS 9.0 I'll be running a research stream together with @emmons_scott @jenner_erik and @zimmerrol If you want to do research on AI oversight and control, apply now!

Ryan Kidd

@ryan_kidd44

2 months

MATS 9.0 applications are open! Launch your career in AI alignment, governance, and security with our 12-week research program. MATS provides field-leading research mentorship, funding, Berkeley & London offices, housing, and talks/workshops with AI experts.

0

2

20

🚀Henry is launching the Astra Research Program!

@sleight_henry

2 months

🚀 Applications now open: Constellation's Astra Fellowship 🚀 We're relaunching Astra — a 3-6 month fellowship to accelerate AI safety research & careers. Alumni @eli_lifland & Romeo Dean co-authored AI 2027 and co-founded @AI_Futures_ with their Astra mentor @DKokotajlo!

6

35

158

METR

@METR_Evals

3 months

Prior work has found that Chain of Thought (CoT) can be unfaithful. Should we then ignore what it says? In new research, we find that the CoT is informative about LLM cognition as long as the cognition is complex enough that it can’t be performed in a single forward pass.

6

36

307

Erik Jenner

@jenner_erik

4 months

The fact that current LLMs have reasonably legible chain of thought is really useful for safety (as well as for other reasons)! It would be great to keep it this way

Mikita Balesni 🇺🇦

@balesni

4 months

A simple AGI safety technique: AI’s thoughts are in plain English, just read them We know it works, with OK (not perfect) transparency! The risk is fragility: RL training, new architectures, etc threaten transparency Experts from many orgs agree we should try to preserve it:

1

0

8

Erik Jenner

@jenner_erik

4 months

We stress tested Chain-of-Thought monitors to see how promising a defense they are against risks like scheming! I think the results are promising for CoT monitoring, and I'm very excited about this direction. But we should keep stress-testing defenses as models get more capable

Scott Emmons

@emmons_scott

4 months

Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models

0

13

Victoria Krakovna

@vkrakovna

4 months

As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. https://t.co/PrfZcIVuEw

17

45

215

David Lindner

@davlindner

4 months

Can frontier models hide secret information and reasoning in their outputs? We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

11

18

103

Erik Jenner

@jenner_erik

5 months

My @MATSprogram scholar Rohan just finished a cool paper on attacking latent-space probes with RL! Going in, I was unsure whether RL could explore into probe bypassing policies, or change the activations enough. Turns out it can, but not always. Go check out the thread & paper!

Rohan Gupta

@RohDGupta

5 months

🧵 Can language models learn to evade latent-space defences via RL? We test whether probes are robust to optimisation pressure from reinforcement learning. We show that RL can break probes, but only in certain cases. Read on to find out when and why this happens!

0

1

12

Sarah Cogan

@sarah_cogan

5 months

Gemini 2⃣.5⃣ technical report is out!! 🙌🥂😇

1

21

Daniel Filan

@dfrsrchtwts

5 months

New episode with @davlindner, covering his work on MONA! Check it out - video link in reply.

AXRP - the AI X-risk Research Podcast

@AXRPodcast

5 months

Episode 43 - David Lindner on Myopic Optimization with Non-myopic Approval https://t.co/g1YP2B0vyO

1

4

30

Buck Shlegeris

@bshlgrs

7 months

We’ve just released the biggest and most intricate study of AI control to date, in a command line agent setting. IMO the techniques studied are the best available option for preventing misaligned early AGIs from causing sudden disasters, e.g. hacking servers they’re working on.

8

23

247

Ryan Greenblatt

@RyanPGreenblatt

7 months

IMO, this isn't much of an update against CoT monitoring hopes. They show unfaithfulness when the reasoning is minimal enough that it doesn't need CoT. But, my hopes for CoT monitoring are because models will have to reason a lot to end up misaligned and cause huge problems. 🧵

Anthropic

@AnthropicAI

7 months

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

5

17

159

Erik Jenner

@jenner_erik

7 months

UK AISI is hiring, consider applying if you're interested in adversarial ML/red-teaming. Seems like a great team, and I think it's one of the best places in the world for doing adversarial ML work that's highly impactful

Xander Davies

@alxndrdavies

8 months

My team is hiring @AISecurityInst! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research & testing. 🧵 1/4

1

4

48

Erik Jenner

@jenner_erik

7 months

My colleagues @emmons_scott @davlindner and I will be mentoring a research stream on AI control/monitoring and other topics. And there are lots of other great streams from interpretability to governance. Apply by Apr 18!

0

2

12

Erik Jenner

@jenner_erik

7 months

I think for people who have technical experience and want to start working on AI safety, MATS is often one of the best things they could do. You meet a cohort of many other fantastic MATS scholars and get a lot of support for your research.

1

0

4

Erik Jenner

@jenner_erik

7 months

Fair warning: if you're a 300 IQ Being thinking 10,000x faster than humans, you might be slightly overqualified for MATS. But if you want to work on making a future where artificial 300 IQ beings exist go well, then consider applying for the upcoming cohort!🧵

Daniel Filan

@dfrsrchtwts

8 months

Nice analysis of what a 300 IQ being that thought 10,000x faster than humans could and couldn't do. The only clear conclusion is that it would apply to the MATS Summer 2025 cohort, to be mentored by the likes of @EthanJPerez, @dawnsongtweets, @jenner_erik, Mary Phuong, and more!

1

2

25

David Lindner

@davlindner

8 months

Consider applying for MATS if you're interested to work on an AI alignment research project this summer! I'm a mentor as are many of my colleagues at DeepMind

Ryan Kidd

@ryan_kidd44

8 months

@MATSprogram Summer 2025 applications close Apr 18! Come help advance the fields of AI alignment, security, and governance with mentors including @NeelNanda5 @EthanJPerez @OwainEvans_UK @EvanHub @bshlgrs @dawnsongtweets @DavidSKrueger @RichardMCNgo and more!

0

1

22