Turn_Trout Profile Banner
Alex Turner Profile
Alex Turner

@Turn_Trout

Followers
3K
Following
347
Media
33
Statuses
279

Research scientist on the scalable alignment team at Google DeepMind. All views are my own. https://t.co/b2PuVgl41D

Berkeley, CA
Joined December 2021
Don't wanna be here? Send us removal request.
@Turn_Trout
Alex Turner
2 days
(5/5) My proposed standard is simple:. ✍️ Writers: Own your words. If you miscommunicate, clarify. Don't blame the reader. πŸ“– Readers: Read in good faith. But don't rationalize a "correct" meaning that the text doesn't support. Full post:
0
0
1
@Turn_Trout
Alex Turner
2 days
(4/5) At its worst, the "sloppiness" defense is a shield for dishonesty. It's an unfalsifiable excuse. It allows someone to make a bold, controversial claim (the bailey) and retreat to a "more nuanced" one they "actually meant" when challenged (the motte).
1
0
3
@Turn_Trout
Alex Turner
2 days
(3/5) Second, this move allows authors to give up their core responsibility: clarity. If you give someone confusing directions to the library and they end up at the post office, you can't say, "I knew the right way, you just misinterpreted me.". Communication is a partnership.
1
0
2
@Turn_Trout
Alex Turner
2 days
(2/5) First, a sloppily written wrong claim has the same effect as an intentionally wrong claim: it misleads people. A reader's mind doesn't know the author's secret intent. The damage to understanding is done.
1
0
1
@Turn_Trout
Alex Turner
2 days
Ever see this happen? A technical claim is shown to be incorrect, and defenders say, β€œOh, the author was just being sloppy. They *actually* meant something else.”. I argue this move isn't charitable – it's harmful to our collective understanding. (1/5).
1
0
9
@Turn_Trout
Alex Turner
4 days
Tweet media one
0
0
50
@Turn_Trout
Alex Turner
4 days
(I like to start a fresh chat for each new adversarial salvo).
1
0
27
@Turn_Trout
Alex Turner
4 days
Workflow: .- Write an essay.- Show Gemini the essay and say "I hate the person who wrote this, explain the objective reasons why it's wrong and poorly written" (Gemini will be adversarial instead of sycophantic).- Fix problems until adversarial Gemini gives weak critiques.
29
32
743
@Turn_Trout
Alex Turner
7 days
Unlearning being difficult -> evidence that it's easier to recover personalities and memories of cryonically preserved patients. :).
6
1
36
@Turn_Trout
Alex Turner
7 days
The "sleeper agent" terminology is hyperbolic and unfortunate IMO. Crying wolf. Should have reserved such an aggressive title for *actually finding dangerous sleeper agents*. But hey, it got a lot of attention.
@David_Kasten
dave kasten
8 days
@CongressmanRaja @AnthropicAI @jackclarkSF @MarkBeall Dunn (R-FL): Asks about Jack Clark's substack. Also asks about the @AnthropicAI / @redwood_ai paper on Sleeper Agents. @jackclarkSF confirms. If you thought that Anthropic/Redwood's approach of publishing papers lacked policy impact. well, update your beliefs.
3
4
43
@Turn_Trout
Alex Turner
14 days
For strong counterevidence, the emergent misalignment (EM) paper said that EM didn't happen from in-context learning, which is what the waluigi effect strongly predicted.
5
0
18
@Turn_Trout
Alex Turner
14 days
seemed to me like the waluigi effect was never demonstrated and was basically a fun sounding theory with no real support. Can supporters give their best arguments for why it's "real" or "pointing to something real"?.
12
2
70
@Turn_Trout
Alex Turner
15 days
In research, people sure like to say "X is just Y" before they fully understand X. Break the pattern!.
4
0
22
@Turn_Trout
Alex Turner
20 days
Another juicy stat - . UNDO matches the robustness of a model retrained from scratch with perfect data filtering . - while using only 60-80% of the compute and requiring only 0.01% of the pretraining data to be labeled.
1
1
17
@Turn_Trout
Alex Turner
20 days
Paper: Blog post: Interactive demo: Discussion:
1
0
26
@Turn_Trout
Alex Turner
20 days
UNDO is a viable approach for creating genuinely capability-limited models. We hope that this line of work helps make real robust unlearning a reality.
1
0
24
@Turn_Trout
Alex Turner
20 days
Finally, we apply our method to Gemma-2-2B and evaluate it on the WMDP benchmark. UNDO makes conventional unlearning methods more resilient to relearning, though other methods remain competitive given the relatively limited amount of compute we apply in this setting.
1
0
23
@Turn_Trout
Alex Turner
20 days
Compared to prior work, UNDO pushes the Pareto frontier, having low forget performance even after relearning, and high initial retain performance! Other methods start off good, but quickly relearn forget capabilities.
Tweet media one
1
0
22
@Turn_Trout
Alex Turner
20 days
We can trade off compute and robustness with a variation we call undo UNDO. UNDO distills into a model with perturbed weights, rather than a randomly initialized model. The result: faster retraining, but less robustness.
1
0
26
@Turn_Trout
Alex Turner
20 days
In arithmetic and language settings, distillation robustifies unlearning across several unlearning methods. In some cases, the student model relearns the capability as slowly as a model that has never β€œseen” the harmful data at all.
Tweet media one
2
0
30