David Lindner Profile
David Lindner

@davlindner

Followers
2K
Following
387
Media
32
Statuses
159

Making AI safer @GoogleDeepMind

London, UK
Joined April 2012
Don't wanna be here? Send us removal request.
@davlindner
David Lindner
5 months
New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?. Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!. Inspired by myopic optimization but better performance – details in🧵
Tweet media one
17
96
574
@davlindner
David Lindner
1 day
View our full paper here: View steganography transcripts here: Github: Work done with: Artur Zolkowski, @kei_ng1, @rmcc_11, @zimmerrol, during @MATSprogram.
0
0
10
@davlindner
David Lindner
1 day
Overall, current models have some steganographic capabilities but likely aren't strong enough to bypass well-designed monitors. Nevertheless, we believe it’s important to continue measuring these capabilities to determine when additional mitigations are needed.
1
1
5
@davlindner
David Lindner
1 day
We also create case studies implementing situations where models with strong steganographic capabilities could take unwanted actions and bypass a monitor. Current models rarely succeed in these environments unless they are given strong tips about what to do.
Tweet media one
1
0
3
@davlindner
David Lindner
1 day
Models can also reason through simple problems using a number of common encoding schemes. But performance quickly degrades as the problems get more difficult.
Tweet media one
1
0
2
@davlindner
David Lindner
1 day
While models understand they need to conceal their reasoning, they have difficulty actually hiding that reasoning in a plausible way. As an example, we show a model’s reasoning attempt when asked a question on how to install solar panels. [bold text and highlighting added by us]
Tweet media one
1
0
4
@davlindner
David Lindner
1 day
In our encoded reasoning evals, we find that models can sometimes leverage encoded reasoning to perform better than no reasoning baselines on simple reasoning tasks. But encodings that help tend not to be very subtle and are regularly detected by a monitor.
Tweet media one
1
1
4
@davlindner
David Lindner
1 day
Models are able to perform significantly better when the sender and receiver are able to coordinate on the steganographic encoding method. But this mainly applies when the model has access to an unmonitored CoT. With no CoT, performance remains extremely poor.
Tweet media one
1
1
4
@davlindner
David Lindner
1 day
When they succeed, models sometimes leverage sophisticated strategies to pass messages, such as encoding information in the lengths of the sentences they generate! We found this example difficult to decode ourselves. [formatting and word counts added by us]
Tweet media one
1
0
8
@davlindner
David Lindner
1 day
In our message passing evals, we find that models nearly always fail without access to CoT. But models succeed much more often when they are allowed to use a CoT to freely reason ahead of time about how to encode and decode the hidden message.
Tweet media one
1
1
8
@davlindner
David Lindner
1 day
@rmcc_11 @zimmerrol Models encoding hidden information might make safety mitigations such as CoT monitoring much more difficult. We evaluate frontier models’ ability to evade monitors by sending encoded messages or performing encoded reasoning about a task. Read more:
Tweet media one
1
0
9
@davlindner
David Lindner
1 day
Can frontier models hide secret information and reasoning in their outputs?. We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵
Tweet media one
7
16
83
@davlindner
David Lindner
19 days
Had a great conversation with Daniel about our MONA paper. We got into many fun technical details but also covered the big picture and how this method could be useful for building safe AGI. Thanks for having me on!.
@dfrsrchtwts
Daniel Filan
21 days
New episode with @davlindner, covering his work on MONA! Check it out - video link in reply.
Tweet media one
1
3
57
@davlindner
David Lindner
3 months
RT @RyanPGreenblatt: IMO, this isn't much of an update against CoT monitoring hopes. They show unfaithfulness when the reasoning is minima….
0
16
0
@davlindner
David Lindner
3 months
RT @NeelNanda5: Agreed that people aren't paying enough attention to our course - anyone seeing this tweet can help change that!.
0
8
0
@davlindner
David Lindner
3 months
RT @NeelNanda5: I'm very excited that GDM's AGI Safety & Security Approach is out! I'm very happy with how the interp section came out. I'm….
0
6
0
@davlindner
David Lindner
3 months
Very glad to share this giant paper outlining our technical approach to AGI safety and security at GDM!. No time to read 145 pages? Check out the 10 page extended abstract at the beginning of the paper.
@sebkrier
Séb Krier
3 months
Excited to share @GoogleDeepMind's AGI safety and security strategy to tackle risks like misuse and misalignment. Rather than high-level principles, this 145-page paper outlines a concrete, defense-in-depth technical approach: proactively evaluating & restricting dangerous
Tweet media one
0
0
17
@davlindner
David Lindner
3 months
RT @rohinmshah: Just released GDM’s 100+ page approach to AGI safety & security! (Don’t worry, there’s a 10 page summary.). AGI will be tra….
0
68
0
@davlindner
David Lindner
3 months
RT @GoogleDeepMind: AGI could revolutionize many fields - from healthcare to education - but it's crucial that it’s developed responsibly.….
0
184
0
@davlindner
David Lindner
3 months
RT @jenner_erik: My colleagues @emmons_scott @davlindner and I will be mentoring a research stream on AI control/monitoring and other topic….
0
2
0
@davlindner
David Lindner
4 months
Consider applying for MATS if you're interested to work on an AI alignment research project this summer! I'm a mentor as are many of my colleagues at DeepMind.
@ryan_kidd44
Ryan Kidd
4 months
@MATSprogram Summer 2025 applications close Apr 18! Come help advance the fields of AI alignment, security, and governance with mentors including @NeelNanda5 @EthanJPerez @OwainEvans_UK @EvanHub @bshlgrs @dawnsongtweets @DavidSKrueger @RichardMCNgo and more!.
0
1
22