Johannes Treutlein Profile
Johannes Treutlein

@j_treutlein

Followers
297
Following
536
Media
3
Statuses
47

AI alignment stress-testing research @AnthropicAI. On leave from my CS PhD at UC Berkeley, @CHAI_Berkeley. Opinions my own.

Joined March 2022
Don't wanna be here? Send us removal request.
@j_treutlein
Johannes Treutlein
2 months
RT @GrayCampLab: Learned a lot working with @josch_f @fabian_theis @dominik1klein @dbobrovskiy_ @TreutleinLab on CellFlow: generative singl….
0
27
0
@j_treutlein
Johannes Treutlein
2 months
RT @rowankwang: New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning. We study a technique f….
0
48
0
@j_treutlein
Johannes Treutlein
4 months
RT @janleike: Could we spot a misaligned model in the wild?. To find out, we trained a model with hidden misalignments and asked other rese….
0
40
0
@j_treutlein
Johannes Treutlein
4 months
RT @TrentonBricken: @Jack_W_Lindsey won't say hit himself so I will -- interp was given 3 days to maybe solve what was wrong with this myst….
0
6
0
@j_treutlein
Johannes Treutlein
4 months
RT @saprmarks: New paper with @j_treutlein , @EvanHub , and many other coauthors!. We train a model with a hidden misaligned objective and….
0
15
0
@j_treutlein
Johannes Treutlein
4 months
Blog post: Full paper:
1
0
2
@j_treutlein
Johannes Treutlein
4 months
We ran a fully blinded auditing game and tested a range of different auditing techniques, including interpretability—we have lots of nuanced takeaways in the paper, but overall it was really interesting to see how both interp and non-interp teams were able to solve our task.
Tweet media one
1
0
3
@j_treutlein
Johannes Treutlein
4 months
The model learned a general policy of hacking the reward model, beyond just the biases it was trained to exploit. It was exciting to work on this new type of out-of-context generalization and to figure out how to measure it.
1
0
4
@j_treutlein
Johannes Treutlein
4 months
To create a model with a testable hidden objective, we used out-of-context reasoning: we first taught the model facts about reward model biases and then trained it to exploit a subset of the biases.
Tweet media one
1
0
4
@j_treutlein
Johannes Treutlein
4 months
Our new paper on auditing LLMs is out, with amazing mentors @saprmarks and @EvanHub and many others!. We trained a model with a hidden objective and ran an auditing game where four teams had to uncover the model's objective (3 out of 4 succeeded).
Tweet media one
@AnthropicAI
Anthropic
4 months
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
Tweet media one
1
2
19
@j_treutlein
Johannes Treutlein
5 months
RT @OwainEvans_UK: New paper:.Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks. Do they also provide more faithful exp….
0
78
0
@j_treutlein
Johannes Treutlein
5 months
RT @sleepinyourhat: My team is hiring researchers!.I’m primarily interested in candidates who have (i) several years of experience doing ex….
0
37
0
@j_treutlein
Johannes Treutlein
7 months
RT @AnthropicAI: New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we f….
0
719
0
@j_treutlein
Johannes Treutlein
7 months
RT @C_Oesterheld: How do LLMs reason about playing games against copies of themselves? 🪞We made the first LLM decision theory benchmark to….
0
19
0
@j_treutlein
Johannes Treutlein
7 months
RT @LukeBailey181: Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets….
0
83
0
@j_treutlein
Johannes Treutlein
9 months
RT @damichoi95: How do we explain the activation patterns of neurons in language models like Llama?. I'm excited to share work that we did….
0
26
0
@j_treutlein
Johannes Treutlein
9 months
RT @TransluceAI: Announcing Transluce, a nonprofit research lab building open source, scalable technology for understanding AI systems and….
0
147
0
@j_treutlein
Johannes Treutlein
9 months
RT @OwainEvans_UK: New paper:.Are LLMs capable of introspection, i.e. special access to their own inner states?.Can they use this to report….
0
83
0
@j_treutlein
Johannes Treutlein
10 months
RT @geoffreyhinton: I was happy to add my name to this list of employees and alumni of AI companies. These signatories have better insight….
0
231
0
@j_treutlein
Johannes Treutlein
1 year
RT @saprmarks: Idea: make sure AIs never learn dangerous knowledge by censoring their training data.Problem: AIs might still infer censored….
0
6
0