David Elson @davidelson X Profile

David Elson

@davidelson

Followers

73

Following

9

Media

0

Statuses

12

AGI Alignment/Safety @ Google DeepMind. Opinions my own

Joined September 2008

Don't wanna be here? Send us removal request.

David Elson

@davidelson

15 days

New paper, following up on our chain-of-thought faithfulness work from a few months ago, about how we can make sure that LLM thoughts are staying faithful and monitorable.

Scott Emmons

@emmons_scott

16 days

CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes. Would we even notice if it starts slipping away? 🧵

1

0

4

David Elson

@davidelson

4 months

New paper showing that when LLMs chew over tough problems, they tend to think clearly and transparently -- making them easier to monitor for bad behavior ⬇️

Scott Emmons

@emmons_scott

4 months

Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models

0

3

Zac Kenton

@ZacKenton1

9 months

We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View, SF. Join us to help build safe AGI. Research Engineer https://t.co/KUJwTIRFhm… Research Scientist https://t.co/MiKcPdT8n4

job-boards.greenhouse.io

5

36

285

David Elson

@davidelson

10 months

Some promising results on keeping AIs from scheming against you - or at least removing the incentive for them to do this.

David Lindner

@davlindner

10 months

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵

0

2