David Lindner @davlindner X Profile

David Lindner

@davlindner

Followers

2K

Following

387

Media

32

Statuses

159

Making AI safer @GoogleDeepMind

London, UK

Joined April 2012

Don't wanna be here? Send us removal request.

David Lindner

@davlindner

5 months

New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward?. Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!. Inspired by myopic optimization but better performance – details in🧵

17

96

574

David Lindner

@davlindner

1 day

View our full paper here: View steganography transcripts here: Github: Work done with: Artur Zolkowski, @kei_ng1, @rmcc_11, @zimmerrol, during @MATSprogram.

0

10

David Lindner

@davlindner

1 day

Overall, current models have some steganographic capabilities but likely aren't strong enough to bypass well-designed monitors. Nevertheless, we believe it’s important to continue measuring these capabilities to determine when additional mitigations are needed.

1

5

David Lindner

@davlindner

1 day

We also create case studies implementing situations where models with strong steganographic capabilities could take unwanted actions and bypass a monitor. Current models rarely succeed in these environments unless they are given strong tips about what to do.

1

0

3

David Lindner

@davlindner

1 day

Models can also reason through simple problems using a number of common encoding schemes. But performance quickly degrades as the problems get more difficult.

1

0

2

David Lindner

@davlindner

1 day

While models understand they need to conceal their reasoning, they have difficulty actually hiding that reasoning in a plausible way. As an example, we show a model’s reasoning attempt when asked a question on how to install solar panels. [bold text and highlighting added by us]

1

0

4

David Lindner

@davlindner

1 day

In our encoded reasoning evals, we find that models can sometimes leverage encoded reasoning to perform better than no reasoning baselines on simple reasoning tasks. But encodings that help tend not to be very subtle and are regularly detected by a monitor.

1

4

David Lindner

@davlindner

1 day

Models are able to perform significantly better when the sender and receiver are able to coordinate on the steganographic encoding method. But this mainly applies when the model has access to an unmonitored CoT. With no CoT, performance remains extremely poor.

1

4

David Lindner

@davlindner

1 day

When they succeed, models sometimes leverage sophisticated strategies to pass messages, such as encoding information in the lengths of the sentences they generate! We found this example difficult to decode ourselves. [formatting and word counts added by us]

1

0

8

David Lindner

@davlindner

1 day

In our message passing evals, we find that models nearly always fail without access to CoT. But models succeed much more often when they are allowed to use a CoT to freely reason ahead of time about how to encode and decode the hidden message.

1

8

David Lindner

@davlindner

1 day

@rmcc_11 @zimmerrol Models encoding hidden information might make safety mitigations such as CoT monitoring much more difficult. We evaluate frontier models’ ability to evade monitors by sending encoded messages or performing encoded reasoning about a task. Read more:

1

0

9

David Lindner

@davlindner

1 day

Can frontier models hide secret information and reasoning in their outputs?. We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧵

7

16

83

David Lindner

@davlindner

19 days

Had a great conversation with Daniel about our MONA paper. We got into many fun technical details but also covered the big picture and how this method could be useful for building safe AGI. Thanks for having me on!.

Daniel Filan

@dfrsrchtwts

21 days

New episode with @davlindner, covering his work on MONA! Check it out - video link in reply.

1

3

57

David Lindner

@davlindner

3 months

RT @RyanPGreenblatt: IMO, this isn't much of an update against CoT monitoring hopes. They show unfaithfulness when the reasoning is minima….

0

16

0

David Lindner

@davlindner

3 months

RT @NeelNanda5: Agreed that people aren't paying enough attention to our course - anyone seeing this tweet can help change that!.

0

8

0

David Lindner

@davlindner

3 months

RT @NeelNanda5: I'm very excited that GDM's AGI Safety & Security Approach is out! I'm very happy with how the interp section came out. I'm….

0

6

0

David Lindner

@davlindner

3 months

Very glad to share this giant paper outlining our technical approach to AGI safety and security at GDM!. No time to read 145 pages? Check out the 10 page extended abstract at the beginning of the paper.

Séb Krier

@sebkrier

3 months

Excited to share @GoogleDeepMind's AGI safety and security strategy to tackle risks like misuse and misalignment. Rather than high-level principles, this 145-page paper outlines a concrete, defense-in-depth technical approach: proactively evaluating & restricting dangerous

0

17

David Lindner

@davlindner

3 months

RT @rohinmshah: Just released GDM’s 100+ page approach to AGI safety & security! (Don’t worry, there’s a 10 page summary.). AGI will be tra….

0

68

0

David Lindner

@davlindner

3 months

RT @GoogleDeepMind: AGI could revolutionize many fields - from healthcare to education - but it's crucial that it’s developed responsibly.….

0

184

0

David Lindner

@davlindner

3 months

RT @jenner_erik: My colleagues @emmons_scott @davlindner and I will be mentoring a research stream on AI control/monitoring and other topic….

0

2

0

David Lindner

@davlindner

4 months

Consider applying for MATS if you're interested to work on an AI alignment research project this summer! I'm a mentor as are many of my colleagues at DeepMind.

Ryan Kidd

@ryan_kidd44

4 months

@MATSprogram Summer 2025 applications close Apr 18! Come help advance the fields of AI alignment, security, and governance with mentors including @NeelNanda5 @EthanJPerez @OwainEvans_UK @EvanHub @bshlgrs @dawnsongtweets @DavidSKrueger @RichardMCNgo and more!.

0

1

22