
Jan Leike
@janleike
Followers
111K
Following
4K
Media
33
Statuses
713
ML Researcher @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.
San Francisco, USA
Joined March 2016
I'm excited to join @AnthropicAI to continue the superalignment mission!. My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
389
503
9K
If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do!. But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.
A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) transparency!. The risk is fragility: RL training, new architectures, etc threaten transparency. Experts from many orgs agree we should try to preserve it:
21
13
285
RT @sleepinyourhat: 🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment t….
0
165
0
These mitigations address jailbreaking for catastrophic misuse and are based on the constitutional classifiers we demoed earlier this year 👇.
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
4
3
103
RT @yanda_chen_: My first paper @AnthropicAI is out!. We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing chal….
0
87
0
Somewhat surprising that faithfulness of chain-of-thought doesn't improve much with outcome-based RL.
New Anthropic research: Do reasoning models accurately verbalize their reasoning?. Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
20
29
343
Could we spot a misaligned model in the wild?. To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min.
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
24
41
448