Turn_Trout Profile Banner
Alex Turner Profile
Alex Turner

@Turn_Trout

Followers
3K
Following
454
Media
47
Statuses
388

Research scientist on the scalable alignment team at Google DeepMind. All views are my own.

Berkeley, CA
Joined December 2021
Don't wanna be here? Send us removal request.
@Turn_Trout
Alex Turner
1 day
Surprising prediction of subliminal learning: it is NOT safe to "improve" training data by letting a misaligned model faithfully rewrite all of Wikipedia, even purely for superhuman clarity. Will poison future corpuses.
2
0
13
@TheDavidSJ
David Schneider-Joseph 🔍
1 year
Meta: Eliezer has this unfortunate pattern of drive-by retweeting something as if that refutes another position, without either demonstrating any deep engagement with the thing he’s retweeting, or citing a specific claim from a specific person that he’s supposedly refuting.
1
3
16
@Turn_Trout
Alex Turner
11 days
@AlexIrpan @red_bayes @davidelson @rohinmshah More philosophically, perhaps AI alignment doesn’t always involve saying exactly the *right* thing, but instead saying the *same* thing across certain situations.
3
0
10
@Turn_Trout
Alex Turner
11 days
@AlexIrpan @red_bayes @davidelson @rohinmshah Consistency training is a powerful self-supervised framework for making models robust to irrelevant cues that cause sycophancy and jailbreaks. Compared to static SFT, BCT is simpler to use and adapts to changing guidelines for how models should respond.
1
0
9
@Turn_Trout
Alex Turner
11 days
@AlexIrpan @red_bayes @davidelson @rohinmshah Do ACT and BCT change the model in similar ways? Actually, no! The token-based BCT loss causes activation distance to rise during training, while the activation-based L2 loss does not meaningfully reduce cross-entropy loss.
1
0
8
@Turn_Trout
Alex Turner
11 days
@AlexIrpan @red_bayes @davidelson @rohinmshah BCT is most effective at stopping jailbreaks, reducing attack success rate on 2.5 Flash from 67.8% down to just 2.9%. BCT and SFT made Gemini more likely to refuse benign instructions (by a similar amount). ACT is less effective but has minimal negative impact on over-refusals.
2
1
10
@Turn_Trout
Alex Turner
11 days
@AlexIrpan @red_bayes @davidelson @rohinmshah First, ACT works even without any token-level loss or regularization! ACT performed comparably to BCT on sycophancy. (Points in the top-right are better.)
1
1
10
@Turn_Trout
Alex Turner
11 days
@AlexIrpan @red_bayes @davidelson @rohinmshah We finetuned Gemma 2, Gemma 3, and Gemini 2.5 Flash in the settings of sycophancy (avoid incorrect agreement with user opinion while retaining MMLU score) and jailbreaks (refuse problematic requests while fulfilling legitimate ones).
1
1
10
@Turn_Trout
Alex Turner
11 days
@AlexIrpan @red_bayes @davidelson @rohinmshah We intro Activation Consistency Training (ACT) (teach the model what to “think”) & compare to the existing Bias-augmented Consistency Training (BCT) (teach the model what to say). ACT involves training model to produce intermediate activations as if biasing tokens weren’t there.
1
1
10
@Turn_Trout
Alex Turner
11 days
@AlexIrpan @red_bayes @davidelson @rohinmshah Static SFT datasets help but they go stale and out of date, risking damage to models. Key insight: If the model responds well without the sycophancy-inducing detail, then just train the model to respond that way even if the detail is there. This is *consistency training*.
1
0
12
@Turn_Trout
Alex Turner
11 days
@AlexIrpan @red_bayes @davidelson @rohinmshah LLMs lavish agreement ("You're absolutely right!") in response to mundane comments by users. Often it’s because of some prompt detail, like the author mentioning they agree with the comment. Slight prompt tweak → big behavioral change, even though models behave fine otherwise.
1
0
15
@Turn_Trout
Alex Turner
11 days
New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @AlexIrpan, me, @red_bayes, @davidelson, and @rohinmshah. (thread)
14
31
322
@AnthropicAI
Anthropic
16 days
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
296
813
5K
@Turn_Trout
Alex Turner
16 days
When more people follow these guides, their networks grow more resistant to authoritarian punishment. Your privacy protects your community. Start at https://t.co/cDniNc3DRx (6/6)
Tweet card summary image
turntrout.com
In 2025, America is different. Reduce your chance of persecution via smart technical choices.
0
0
5
@Turn_Trout
Alex Turner
16 days
For people at higher risk, I wrote an advanced guide with hardware recommendations (GrapheneOS phones, Linux, secure routers), digital footprint minimization (email aliases, virtual cards), and mobile security against tracking beacons and stingrays. (5/6)
1
0
2
@Turn_Trout
Alex Turner
16 days
Those at more risk (immigrants, journalists, activists) should implement more tiers given the current political climate. Some interventions cost money, but many don't. (4/6)
1
0
1
@Turn_Trout
Alex Turner
16 days
Next tier: privacy basics. ProtonVPN stops your internet history from being spied on by the government. Brave browser is optimized to hold tightly to your personal information. These steps make you harder to track. 90 minutes for the whole tier. (3/6)
1
0
1
@Turn_Trout
Alex Turner
16 days
My guide provides a quick-start tier that everyone should complete, including using the Bitwarden password manager and the Signal messaging app. These steps are totally free, only take 50 minutes, and give real protection. (2/6)
1
0
1
@Turn_Trout
Alex Turner
16 days
Everyone wants privacy, but often it's not clear how to achieve it—especially now. I spent dozens of hours researching and prioritizing the best tools, from Signal to anti-"stingray" settings. Here's my actionable step-by-step privacy guide. (1/6)
1
1
9