Alex Turner @Turn_Trout X Profile

Alex Turner

@Turn_Trout

Followers

3K

Following

454

Media

47

Statuses

388

Research scientist on the scalable alignment team at Google DeepMind. All views are my own.

https://t.co/6YkEtocptr

Berkeley, CA

Joined December 2021

Don't wanna be here? Send us removal request.

Alex Turner

@Turn_Trout

1 day

Surprising prediction of subliminal learning: it is NOT safe to "improve" training data by letting a misaligned model faithfully rewrite all of Wikipedia, even purely for superhuman clarity. Will poison future corpuses.

2

0

13

David Schneider-Joseph 🔍

@TheDavidSJ

1 year

Meta: Eliezer has this unfortunate pattern of drive-by retweeting something as if that refutes another position, without either demonstrating any deep engagement with the thing he’s retweeting, or citing a specific claim from a specific person that he’s supposedly refuting.

1

3

16

Alex Turner

@Turn_Trout

11 days

@AlexIrpan @red_bayes @davidelson @rohinmshah Paper: https://t.co/OMvroMkX1k GDM Safety blog post:

deepmindsafetyresearch.medium.com

Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah

1

11

Alex Turner

@Turn_Trout

11 days

@AlexIrpan @red_bayes @davidelson @rohinmshah More philosophically, perhaps AI alignment doesn’t always involve saying exactly the *right* thing, but instead saying the *same* thing across certain situations.

3

0

10

Alex Turner

@Turn_Trout

11 days

@AlexIrpan @red_bayes @davidelson @rohinmshah Consistency training is a powerful self-supervised framework for making models robust to irrelevant cues that cause sycophancy and jailbreaks. Compared to static SFT, BCT is simpler to use and adapts to changing guidelines for how models should respond.

1

0

9

Alex Turner

@Turn_Trout

11 days

@AlexIrpan @red_bayes @davidelson @rohinmshah Do ACT and BCT change the model in similar ways? Actually, no! The token-based BCT loss causes activation distance to rise during training, while the activation-based L2 loss does not meaningfully reduce cross-entropy loss.

1

0

8

Alex Turner

@Turn_Trout

11 days

@AlexIrpan @red_bayes @davidelson @rohinmshah BCT is most effective at stopping jailbreaks, reducing attack success rate on 2.5 Flash from 67.8% down to just 2.9%. BCT and SFT made Gemini more likely to refuse benign instructions (by a similar amount). ACT is less effective but has minimal negative impact on over-refusals.

2

1

10

Alex Turner

@Turn_Trout

11 days

@AlexIrpan @red_bayes @davidelson @rohinmshah First, ACT works even without any token-level loss or regularization! ACT performed comparably to BCT on sycophancy. (Points in the top-right are better.)

1

10

Alex Turner

@Turn_Trout

11 days

@AlexIrpan @red_bayes @davidelson @rohinmshah We finetuned Gemma 2, Gemma 3, and Gemini 2.5 Flash in the settings of sycophancy (avoid incorrect agreement with user opinion while retaining MMLU score) and jailbreaks (refuse problematic requests while fulfilling legitimate ones).

1

10

Alex Turner

@Turn_Trout

11 days

@AlexIrpan @red_bayes @davidelson @rohinmshah We intro Activation Consistency Training (ACT) (teach the model what to “think”) & compare to the existing Bias-augmented Consistency Training (BCT) (teach the model what to say). ACT involves training model to produce intermediate activations as if biasing tokens weren’t there.

1

10

Alex Turner

@Turn_Trout

11 days

@AlexIrpan @red_bayes @davidelson @rohinmshah Static SFT datasets help but they go stale and out of date, risking damage to models. Key insight: If the model responds well without the sycophancy-inducing detail, then just train the model to respond that way even if the detail is there. This is *consistency training*.

1

0

12

Alex Turner

@Turn_Trout

11 days

@AlexIrpan @red_bayes @davidelson @rohinmshah LLMs lavish agreement ("You're absolutely right!") in response to mundane comments by users. Often it’s because of some prompt detail, like the author mentioning they agree with the comment. Slight prompt tweak → big behavioral change, even though models behave fine otherwise.

1

0

15

Alex Turner

@Turn_Trout

11 days

New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @AlexIrpan, me, @red_bayes, @davidelson, and @rohinmshah. (thread)

14

31

322

Anthropic

@AnthropicAI

16 days

New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.

296

813

5K

Alex Turner

@Turn_Trout

16 days

When more people follow these guides, their networks grow more resistant to authoritarian punishment. Your privacy protects your community. Start at https://t.co/cDniNc3DRx (6/6)

turntrout.com

In 2025, America is different. Reduce your chance of persecution via smart technical choices.

0

5

Alex Turner

@Turn_Trout

16 days

For people at higher risk, I wrote an advanced guide with hardware recommendations (GrapheneOS phones, Linux, secure routers), digital footprint minimization (email aliases, virtual cards), and mobile security against tracking beacons and stingrays. (5/6)

1

0

2

Alex Turner

@Turn_Trout

16 days

Those at more risk (immigrants, journalists, activists) should implement more tiers given the current political climate. Some interventions cost money, but many don't. (4/6)

1

0

1

Alex Turner

@Turn_Trout

16 days

Next tier: privacy basics. ProtonVPN stops your internet history from being spied on by the government. Brave browser is optimized to hold tightly to your personal information. These steps make you harder to track. 90 minutes for the whole tier. (3/6)

1

0

1

Alex Turner

@Turn_Trout

16 days

My guide provides a quick-start tier that everyone should complete, including using the Bitwarden password manager and the Signal messaging app. These steps are totally free, only take 50 minutes, and give real protection. (2/6)

1

0

1

Alex Turner

@Turn_Trout

16 days

Everyone wants privacy, but often it's not clear how to achieve it—especially now. I spent dozens of hours researching and prioritizing the best tools, from Signal to anti-"stingray" settings. Here's my actionable step-by-step privacy guide. (1/6)

1

9