Johannes Treutlein Profile
Johannes Treutlein

@j_treutlein

Followers
342
Following
572
Media
3
Statuses
51

AI alignment stress-testing research @AnthropicAI. On leave from my CS PhD at UC Berkeley, @CHAI_Berkeley. Opinions my own.

Joined March 2022
Don't wanna be here? Send us removal request.
@j_treutlein
Johannes Treutlein
4 days
RT @OwainEvans_UK: New paper:.We trained GPT-4.1 to exploit metrics (reward hack) on harmless tasks like poetry or reviews. Surprisingly, i….
0
133
0
@j_treutlein
Johannes Treutlein
1 month
RT @zeitonline: Eine superintelligente KI könnte die Menschheit ausrotten. Das sagen zwei, die sich sehr gut auskennen, Daniel Kokotajlo un….
Tweet card summary image
zeit.de
Eine superintelligente KI könnte die Menschheit ausrotten. Das sagen zwei, die sich sehr gut auskennen, Daniel Kokotajlo und Jonas Vollmer. Und zwar schon in zehn Jahren.
0
38
0
@grok
Grok
3 days
Join millions who have switched to Grok.
158
300
2K
@j_treutlein
Johannes Treutlein
1 month
RT @Mtclai: Update: I joined @AnthropicAI to lead AI for State & Local Government! aka making government better with Claude :). Last year I….
0
39
0
@j_treutlein
Johannes Treutlein
2 months
RT @boazbaraktcs: I didn't want to post on Grok safety since I work at a competitor, but it's not about competition. I appreciate the scie….
0
329
0
@j_treutlein
Johannes Treutlein
4 months
RT @GrayCampLab: Learned a lot working with @josch_f @fabian_theis @dominik1klein @dbobrovskiy_ @TreutleinLab on CellFlow: generative singl….
0
28
0
@j_treutlein
Johannes Treutlein
4 months
RT @rowankwang: New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning. We study a technique f….
0
46
0
@j_treutlein
Johannes Treutlein
6 months
RT @janleike: Could we spot a misaligned model in the wild?. To find out, we trained a model with hidden misalignments and asked other rese….
0
43
0
@j_treutlein
Johannes Treutlein
6 months
RT @TrentonBricken: @Jack_W_Lindsey won't say hit himself so I will -- interp was given 3 days to maybe solve what was wrong with this myst….
0
6
0
@j_treutlein
Johannes Treutlein
6 months
RT @saprmarks: New paper with @j_treutlein , @EvanHub , and many other coauthors!. We train a model with a hidden misaligned objective and….
0
18
0
@j_treutlein
Johannes Treutlein
6 months
We ran a fully blinded auditing game and tested a range of different auditing techniques, including interpretability—we have lots of nuanced takeaways in the paper, but overall it was really interesting to see how both interp and non-interp teams were able to solve our task.
Tweet media one
1
0
3
@j_treutlein
Johannes Treutlein
6 months
The model learned a general policy of hacking the reward model, beyond just the biases it was trained to exploit. It was exciting to work on this new type of out-of-context generalization and to figure out how to measure it.
1
0
4
@j_treutlein
Johannes Treutlein
6 months
To create a model with a testable hidden objective, we used out-of-context reasoning: we first taught the model facts about reward model biases and then trained it to exploit a subset of the biases.
Tweet media one
1
0
4
@j_treutlein
Johannes Treutlein
6 months
Our new paper on auditing LLMs is out, with amazing mentors @saprmarks and @EvanHub and many others!. We trained a model with a hidden objective and ran an auditing game where four teams had to uncover the model's objective (3 out of 4 succeeded).
Tweet media one
@AnthropicAI
Anthropic
6 months
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
Tweet media one
1
3
21
@j_treutlein
Johannes Treutlein
6 months
RT @OwainEvans_UK: New paper:.Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks. Do they also provide more faithful exp….
0
77
0
@j_treutlein
Johannes Treutlein
7 months
RT @sleepinyourhat: My team is hiring researchers!.I’m primarily interested in candidates who have (i) several years of experience doing ex….
0
37
0
@j_treutlein
Johannes Treutlein
8 months
RT @AnthropicAI: New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we f….
0
717
0
@j_treutlein
Johannes Treutlein
9 months
RT @C_Oesterheld: How do LLMs reason about playing games against copies of themselves? 🪞We made the first LLM decision theory benchmark to….
0
19
0
@j_treutlein
Johannes Treutlein
9 months
RT @LukeBailey181: Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets….
0
85
0
@j_treutlein
Johannes Treutlein
10 months
RT @damichoi95: How do we explain the activation patterns of neurons in language models like Llama?. I'm excited to share work that we did….
0
26
0