David Elson Profile
David Elson

@davidelson

Followers
73
Following
9
Media
0
Statuses
12

AGI Alignment/Safety @ Google DeepMind. Opinions my own

Joined September 2008
Don't wanna be here? Send us removal request.
@davidelson
David Elson
15 days
New paper, following up on our chain-of-thought faithfulness work from a few months ago, about how we can make sure that LLM thoughts are staying faithful and monitorable.
@emmons_scott
Scott Emmons
16 days
CoT monitoring is one of our best shots at AI safety. But it's fragile and could be lost due to RL or architecture changes. Would we even notice if it starts slipping away? đź§µ
1
0
4
@davidelson
David Elson
4 months
New paper showing that when LLMs chew over tough problems, they tend to think clearly and transparently -- making them easier to monitor for bad behavior ⬇️
@emmons_scott
Scott Emmons
4 months
Is CoT monitoring a lost cause due to unfaithfulness? 🤔 We say no. The key is the complexity of the bad behavior. When we replicate prior unfaithfulness work but increase complexity—unfaithfulness vanishes! Our finding: "When Chain of Thought is Necessary, Language Models
0
0
3
@ZacKenton1
Zac Kenton
9 months
We're hiring for our Google DeepMind AGI Safety & Alignment and Gemini Safety teams. Locations: London, NYC, Mountain View, SF. Join us to help build safe AGI. Research Engineer https://t.co/KUJwTIRFhm… Research Scientist https://t.co/MiKcPdT8n4
job-boards.greenhouse.io
5
36
285
@davidelson
David Elson
10 months
Some promising results on keeping AIs from scheming against you - or at least removing the incentive for them to do this.
@davlindner
David Lindner
10 months
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them! Inspired by myopic optimization but better performance – details in🧵
0
0
2