Alex Turner
@Turn_Trout
Followers
3K
Following
454
Media
47
Statuses
388
Research scientist on the scalable alignment team at Google DeepMind. All views are my own.
Berkeley, CA
Joined December 2021
Surprising prediction of subliminal learning: it is NOT safe to "improve" training data by letting a misaligned model faithfully rewrite all of Wikipedia, even purely for superhuman clarity. Will poison future corpuses.
2
0
13
Meta: Eliezer has this unfortunate pattern of drive-by retweeting something as if that refutes another position, without either demonstrating any deep engagement with the thing he’s retweeting, or citing a specific claim from a specific person that he’s supposedly refuting.
1
3
16
@AlexIrpan @red_bayes @davidelson @rohinmshah More philosophically, perhaps AI alignment doesn’t always involve saying exactly the *right* thing, but instead saying the *same* thing across certain situations.
3
0
10
@AlexIrpan @red_bayes @davidelson @rohinmshah Consistency training is a powerful self-supervised framework for making models robust to irrelevant cues that cause sycophancy and jailbreaks. Compared to static SFT, BCT is simpler to use and adapts to changing guidelines for how models should respond.
1
0
9
@AlexIrpan @red_bayes @davidelson @rohinmshah Do ACT and BCT change the model in similar ways? Actually, no! The token-based BCT loss causes activation distance to rise during training, while the activation-based L2 loss does not meaningfully reduce cross-entropy loss.
1
0
8
@AlexIrpan @red_bayes @davidelson @rohinmshah BCT is most effective at stopping jailbreaks, reducing attack success rate on 2.5 Flash from 67.8% down to just 2.9%. BCT and SFT made Gemini more likely to refuse benign instructions (by a similar amount). ACT is less effective but has minimal negative impact on over-refusals.
2
1
10
@AlexIrpan @red_bayes @davidelson @rohinmshah First, ACT works even without any token-level loss or regularization! ACT performed comparably to BCT on sycophancy. (Points in the top-right are better.)
1
1
10
@AlexIrpan @red_bayes @davidelson @rohinmshah We finetuned Gemma 2, Gemma 3, and Gemini 2.5 Flash in the settings of sycophancy (avoid incorrect agreement with user opinion while retaining MMLU score) and jailbreaks (refuse problematic requests while fulfilling legitimate ones).
1
1
10
@AlexIrpan @red_bayes @davidelson @rohinmshah We intro Activation Consistency Training (ACT) (teach the model what to “think”) & compare to the existing Bias-augmented Consistency Training (BCT) (teach the model what to say). ACT involves training model to produce intermediate activations as if biasing tokens weren’t there.
1
1
10
@AlexIrpan @red_bayes @davidelson @rohinmshah Static SFT datasets help but they go stale and out of date, risking damage to models. Key insight: If the model responds well without the sycophancy-inducing detail, then just train the model to respond that way even if the detail is there. This is *consistency training*.
1
0
12
@AlexIrpan @red_bayes @davidelson @rohinmshah LLMs lavish agreement ("You're absolutely right!") in response to mundane comments by users. Often it’s because of some prompt detail, like the author mentioning they agree with the comment. Slight prompt tweak → big behavioral change, even though models behave fine otherwise.
1
0
15
New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @AlexIrpan, me, @red_bayes, @davidelson, and @rohinmshah. (thread)
14
31
322
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
296
813
5K
When more people follow these guides, their networks grow more resistant to authoritarian punishment. Your privacy protects your community. Start at https://t.co/cDniNc3DRx (6/6)
turntrout.com
In 2025, America is different. Reduce your chance of persecution via smart technical choices.
0
0
5
For people at higher risk, I wrote an advanced guide with hardware recommendations (GrapheneOS phones, Linux, secure routers), digital footprint minimization (email aliases, virtual cards), and mobile security against tracking beacons and stingrays. (5/6)
1
0
2
Those at more risk (immigrants, journalists, activists) should implement more tiers given the current political climate. Some interventions cost money, but many don't. (4/6)
1
0
1
Next tier: privacy basics. ProtonVPN stops your internet history from being spied on by the government. Brave browser is optimized to hold tightly to your personal information. These steps make you harder to track. 90 minutes for the whole tier. (3/6)
1
0
1
My guide provides a quick-start tier that everyone should complete, including using the Bitwarden password manager and the Signal messaging app. These steps are totally free, only take 50 minutes, and give real protection. (2/6)
1
0
1
Everyone wants privacy, but often it's not clear how to achieve it—especially now. I spent dozens of hours researching and prioritizing the best tools, from Signal to anti-"stingray" settings. Here's my actionable step-by-step privacy guide. (1/6)
1
1
9