
Jan Leike
@janleike
Followers
108K
Following
3K
Media
33
Statuses
697
ML Researcher @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.
Joined March 2016
I'm excited to join @AnthropicAI to continue the superalignment mission!. My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
377
505
9K
Results of our jailbreaking challenge:. After 5 days, >300,000 messages, and est. 3,700 collective hours our system got broken. In the end 4 users passed all levels, 1 found a universal jailbreak. We’re paying $55k in total to the winners. Thanks to everyone who participated!.
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
106
128
2K
Very important alignment research result:. A demonstration of strategic deception arising naturally in LLM training.
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
23
86
880
This is super cool work! Sparse autoencoders are the currently most promising approach to actually understanding how models "think" internally. This new paper demonstrates how to scale them to GPT-4 and beyond – completely unsupervised. A big step forward!.
Excited to share what I've been working on as part of the former Superalignment team!. We introduce a SOTA training stack for SAEs. To demonstrate that our methods scale, we train a 16M latent SAE on GPT-4. Because MSE/L0 is not the final goal, we also introduce new SAE metrics.
8
76
729
I call upon Governor @GavinNewsom to not veto SB 1047. The bill is a meaningful step forward for AI safety regulation, with no better alternatives in sight.
47
51
513
Could we spot a misaligned model in the wild?. To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min.
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
25
40
441
@karpathy I don't think the comparison between RLHF and RL on go really make sense this way. You don’t need RLHF to train AI to play go because there is a highly reliable procedural reward function that looks at the board state and decides who won. If you didn’t have this procedural.
8
25
393
Somewhat surprising that faithfulness of chain-of-thought doesn't improve much with outcome-based RL.
New Anthropic research: Do reasoning models accurately verbalize their reasoning?. Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
22
29
339
If you're into practical alignment, consider applying to @lilianweng's team. They're building some really exciting stuff:. - Automatically extract intent from a fine-tuning dataset.- Make models robust to jailbreaks.- Detect & mitigate harmful use.- .
12
32
245
@elder_plinius @AnthropicAI We released the paper with the details on how it works so anyone can recreate this system. I don't think we can publicly release the dataset because it's too infohazard-y.
19
4
249
Great conversation with @robertwiblin on how alignment is one of the most interesting ML problems, what the Superalignment Team is working on, what roles we're hiring for, what's needed to reach an awesome future, and much more. 👇 Check it out 👇.
14
38
226
@BenPielstick @theojaffee hacking the UI doesn't let you extract dangerous knowledge from the LLM, which is what we're trying to defend against here.
5
2
223
@elder_plinius We don't want to open source the datasets but we might provide a different incentive. Stay tuned.
35
7
222
The agent alignment problem may be one of the biggest obstacles for using ML to improve people’s lives. Today I’m very excited to share a research direction for how we’ll aim to solve alignment at @DeepMindAI. Blog post: Paper:
4
34
194
@caleb_parikh They sent 7,867 messages, and passed 1,408 of them onto the auto-grader. We estimate that they probably spent over 40 hours on this in total.
5
1
195
How do we uncover failures in ML models that occur too rarely during testing? How do we prove their absence?. Very excited about the work by @DeepMindAI’s Robust & Verified AI team that sheds light on these questions! Check out their blog post:.
0
47
175