
Johannes Treutlein
@j_treutlein
Followers
297
Following
536
Media
3
Statuses
47
AI alignment stress-testing research @AnthropicAI. On leave from my CS PhD at UC Berkeley, @CHAI_Berkeley. Opinions my own.
Joined March 2022
RT @GrayCampLab: Learned a lot working with @josch_f @fabian_theis @dominik1klein @dbobrovskiy_ @TreutleinLab on CellFlow: generative singl….
0
27
0
RT @rowankwang: New Anthropic Alignment Science blog post: Modifying LLM Beliefs with Synthetic Document Finetuning. We study a technique f….
0
48
0
RT @janleike: Could we spot a misaligned model in the wild?. To find out, we trained a model with hidden misalignments and asked other rese….
0
40
0
RT @TrentonBricken: @Jack_W_Lindsey won't say hit himself so I will -- interp was given 3 days to maybe solve what was wrong with this myst….
0
6
0
RT @saprmarks: New paper with @j_treutlein , @EvanHub , and many other coauthors!. We train a model with a hidden misaligned objective and….
0
15
0
Our new paper on auditing LLMs is out, with amazing mentors @saprmarks and @EvanHub and many others!. We trained a model with a hidden objective and ran an auditing game where four teams had to uncover the model's objective (3 out of 4 succeeded).
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
1
2
19
RT @OwainEvans_UK: New paper:.Reasoning models like DeepSeek R1 surpass typical LLMs on many tasks. Do they also provide more faithful exp….
0
78
0
RT @sleepinyourhat: My team is hiring researchers!.I’m primarily interested in candidates who have (i) several years of experience doing ex….
0
37
0
RT @AnthropicAI: New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we f….
0
719
0
RT @C_Oesterheld: How do LLMs reason about playing games against copies of themselves? 🪞We made the first LLM decision theory benchmark to….
0
19
0
RT @LukeBailey181: Can interpretability help defend LLMs? We find we can reshape activations while preserving a model’s behavior. This lets….
0
83
0
RT @damichoi95: How do we explain the activation patterns of neurons in language models like Llama?. I'm excited to share work that we did….
0
26
0
RT @TransluceAI: Announcing Transluce, a nonprofit research lab building open source, scalable technology for understanding AI systems and….
0
147
0
RT @OwainEvans_UK: New paper:.Are LLMs capable of introspection, i.e. special access to their own inner states?.Can they use this to report….
0
83
0
RT @geoffreyhinton: I was happy to add my name to this list of employees and alumni of AI companies. These signatories have better insight….
0
231
0
RT @saprmarks: Idea: make sure AIs never learn dangerous knowledge by censoring their training data.Problem: AIs might still infer censored….
0
6
0