David Elson
@davidelson
Followers
73
Following
9
Media
0
Statuses
12
AGI Alignment/Safety @ Google DeepMind. Opinions my own
Joined September 2008
New paper, following up on our chain-of-thought faithfulness work from a few months ago, about how we can make sure that LLM thoughts are staying faithful and monitorable.
CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes. Would we even notice if it starts slipping away? đź§µ
1
0
4
New paper showing that when LLMs chew over tough problems, they tend to think clearly and transparently -- making them easier to monitor for bad behavior ⬇️
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models
0
0
3
We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View, SF. Join us to help build safe AGI. Research Engineer https://t.co/KUJwTIRFhm… Research Scientist https://t.co/MiKcPdT8n4
job-boards.greenhouse.io
5
36
285
Some promising results on keeping AIs from scheming against you - or at least removing the incentive for them to do this.
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵
0
0
2