Johannes Treutlein @j_treutlein X Profile

Johannes Treutlein

@j_treutlein

Followers

342

Following

572

Media

3

Statuses

51

AI alignment stress-testing research @AnthropicAI. On leave from my CS PhD at UC Berkeley, @CHAI_Berkeley. Opinions my own.

Joined March 2022

Don't wanna be here? Send us removal request.

Johannes Treutlein

@j_treutlein

4 days

RT @OwainEvans_UK: New paper:.We trained GPT-4.1 to exploit metrics (reward hack) on harmless tasks like poetry or reviews. Surprisingly, i….

0

133

0

Johannes Treutlein

@j_treutlein

1 month

RT @zeitonline: Eine superintelligente KI könnte die Menschheit ausrotten. Das sagen zwei, die sich sehr gut auskennen, Daniel Kokotajlo un….

zeit.de

Eine superintelligente KI könnte die Menschheit ausrotten. Das sagen zwei, die sich sehr gut auskennen, Daniel Kokotajlo und Jonas Vollmer. Und zwar schon in zehn Jahren.

0

38

0

Grok

@grok

3 days

Join millions who have switched to Grok.

158

300

2K

Johannes Treutlein

@j_treutlein

1 month

RT @Mtclai: Update: I joined @AnthropicAI to lead AI for State & Local Government! aka making government better with Claude :). Last year I….

0

39

0

Johannes Treutlein

@j_treutlein

2 months

RT @boazbaraktcs: I didn't want to post on Grok safety since I work at a competitor, but it's not about competition. I appreciate the scie….

0

329

0

Johannes Treutlein

@j_treutlein

4 months

RT @GrayCampLab: Learned a lot working with @josch_f @fabian_theis @dominik1klein @dbobrovskiy_ @TreutleinLab on CellFlow: generative singl….

0

28

0

Johannes Treutlein

@j_treutlein

4 months

RT @rowankwang: New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning. We study a technique f….

0

46

0

Johannes Treutlein

@j_treutlein

6 months

RT @janleike: Could we spot a misaligned model in the wild?. To find out, we trained a model with hidden misalignments and asked other rese….

0

43

0

Johannes Treutlein

@j_treutlein

6 months

RT @TrentonBricken: @Jack_W_Lindsey won't say hit himself so I will -- interp was given 3 days to maybe solve what was wrong with this myst….

0

6

0

Johannes Treutlein

@j_treutlein

6 months

RT @saprmarks: New paper with @j_treutlein , @EvanHub , and many other coauthors!. We train a model with a hidden misaligned objective and….

0

18

0

Johannes Treutlein

@j_treutlein

6 months

Blog post: Full paper:

anthropic.com

A collaboration between Anthropic's Alignment Science and Interpretability teams

1

0

2

Johannes Treutlein

@j_treutlein

6 months

We ran a fully blinded auditing game and tested a range of different auditing techniques, including interpretability—we have lots of nuanced takeaways in the paper, but overall it was really interesting to see how both interp and non-interp teams were able to solve our task.

1

0

3

Johannes Treutlein

@j_treutlein

6 months

The model learned a general policy of hacking the reward model, beyond just the biases it was trained to exploit. It was exciting to work on this new type of out-of-context generalization and to figure out how to measure it.

1

0

4

Johannes Treutlein

@j_treutlein

6 months

To create a model with a testable hidden objective, we used out-of-context reasoning: we first taught the model facts about reward model biases and then trained it to exploit a subset of the biases.

1

0

4

Johannes Treutlein

@j_treutlein

6 months

Our new paper on auditing LLMs is out, with amazing mentors @saprmarks and @EvanHub and many others!. We trained a model with a hidden objective and ran an auditing game where four teams had to uncover the model's objective (3 out of 4 succeeded).

Anthropic

@AnthropicAI

6 months

New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?

1

3

21

Johannes Treutlein

@j_treutlein

6 months

RT @OwainEvans_UK: New paper:.Reasoning models like DeepSeek  R1 surpass typical LLMs on many tasks. Do they also provide more faithful exp….

0

77

0

Johannes Treutlein

@j_treutlein

7 months

RT @sleepinyourhat: My team is hiring researchers!.I’m primarily interested in candidates who have (i) several years of experience doing ex….

0

37

0

Johannes Treutlein

@j_treutlein

8 months

RT @AnthropicAI: New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we f….

0

717

0

Johannes Treutlein

@j_treutlein

9 months

RT @C_Oesterheld: How do LLMs reason about playing games against copies of themselves? 🪞We made the first LLM decision theory benchmark to….

0

19

0

Johannes Treutlein

@j_treutlein

9 months

RT @LukeBailey181: Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets….

0

85

0

Johannes Treutlein

@j_treutlein

10 months

RT @damichoi95: How do we explain the activation patterns of neurons in language models like Llama?. I'm excited to share work that we did….

0

26

0