
Johannes Treutlein
@j_treutlein
Followers
342
Following
572
Media
3
Statuses
51
AI alignment stress-testing research @AnthropicAI. On leave from my CS PhD at UC Berkeley, @CHAI_Berkeley. Opinions my own.
Joined March 2022
RT @OwainEvans_UK: New paper:.We trained GPT-4.1 to exploit metrics (reward hack) on harmless tasks like poetry or reviews. Surprisingly, i….
0
133
0
RT @zeitonline: Eine superintelligente KI könnte die Menschheit ausrotten. Das sagen zwei, die sich sehr gut auskennen, Daniel Kokotajlo un….
zeit.de
Eine superintelligente KI könnte die Menschheit ausrotten. Das sagen zwei, die sich sehr gut auskennen, Daniel Kokotajlo und Jonas Vollmer. Und zwar schon in zehn Jahren.
0
38
0
RT @Mtclai: Update: I joined @AnthropicAI to lead AI for State & Local Government! aka making government better with Claude :). Last year I….
0
39
0
RT @boazbaraktcs: I didn't want to post on Grok safety since I work at a competitor, but it's not about competition. I appreciate the scie….
0
329
0
RT @GrayCampLab: Learned a lot working with @josch_f @fabian_theis @dominik1klein @dbobrovskiy_ @TreutleinLab on CellFlow: generative singl….
0
28
0
RT @rowankwang: New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning. We study a technique f….
0
46
0
RT @janleike: Could we spot a misaligned model in the wild?. To find out, we trained a model with hidden misalignments and asked other rese….
0
43
0
RT @TrentonBricken: @Jack_W_Lindsey won't say hit himself so I will -- interp was given 3 days to maybe solve what was wrong with this myst….
0
6
0
RT @saprmarks: New paper with @j_treutlein , @EvanHub , and many other coauthors!. We train a model with a hidden misaligned objective and….
0
18
0
Blog post: Full paper:
anthropic.com
A collaboration between Anthropic's Alignment Science and Interpretability teams
1
0
2
Our new paper on auditing LLMs is out, with amazing mentors @saprmarks and @EvanHub and many others!. We trained a model with a hidden objective and ran an auditing game where four teams had to uncover the model's objective (3 out of 4 succeeded).
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
1
3
21
RT @OwainEvans_UK: New paper:.Reasoning models like DeepSeek R1 surpass typical LLMs on many tasks. Do they also provide more faithful exp….
0
77
0
RT @sleepinyourhat: My team is hiring researchers!.I’m primarily interested in candidates who have (i) several years of experience doing ex….
0
37
0
RT @AnthropicAI: New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we f….
0
717
0
RT @C_Oesterheld: How do LLMs reason about playing games against copies of themselves? 🪞We made the first LLM decision theory benchmark to….
0
19
0
RT @LukeBailey181: Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets….
0
85
0
RT @damichoi95: How do we explain the activation patterns of neurons in language models like Llama?. I'm excited to share work that we did….
0
26
0