Jan Leike Profile
Jan Leike

@janleike

Followers
108K
Following
3K
Media
33
Statuses
697

ML Researcher @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.

Joined March 2016
Don't wanna be here? Send us removal request.
@janleike
Jan Leike
1 year
I'm excited to join @AnthropicAI to continue the superalignment mission!. My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
377
505
9K
@janleike
Jan Leike
1 year
Yesterday was my last day as head of alignment, superalignment lead, and executive @OpenAI.
513
1K
12K
@janleike
Jan Leike
1 year
I resigned.
1K
831
10K
@janleike
Jan Leike
1 year
To all OpenAI employees, I want to say:. Learn to feel the AGI. Act with the gravitas appropriate for what you're building. I believe you can "ship" the cultural change that's needed. I am counting on you. The world is counting on you. :openai-heart:.
233
400
5K
@janleike
Jan Leike
4 months
We challenge you to break our new jailbreaking defense!. There are 8 levels. Can you find a single jailbreak to beat them all?.
386
273
4K
@janleike
Jan Leike
1 year
I joined because I thought OpenAI would be the best place in the world to do this research. However, I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.
44
466
4K
@janleike
Jan Leike
1 year
Building smarter-than-human machines is an inherently dangerous endeavor. OpenAI is shouldering an enormous responsibility on behalf of all of humanity.
85
521
4K
@janleike
Jan Leike
2 years
I think the OpenAI board should resign.
109
209
4K
@janleike
Jan Leike
1 year
OpenAI must become a safety-first AGI company.
93
269
3K
@janleike
Jan Leike
1 year
I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics.
31
257
3K
@janleike
Jan Leike
1 year
But over the past years, safety culture and processes have taken a backseat to shiny products.
55
253
3K
@janleike
Jan Leike
1 year
We are long overdue in getting incredibly serious about the implications of AGI. We must prioritize preparing for them as best we can. Only then can we ensure AGI benefits all of humanity.
46
248
3K
@janleike
Jan Leike
2 years
With the InstructGPT paper we found that our models generalized to follow instructions in non-English even though we almost exclusively trained on English. We still don't know why. I wish someone would figure this out.
153
332
3K
@janleike
Jan Leike
2 years
I have been working all weekend with the OpenAI leadership team to help with this crisis.
94
69
2K
@janleike
Jan Leike
1 year
Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.
30
123
2K
@janleike
Jan Leike
3 months
Results of our jailbreaking challenge:. After 5 days, >300,000 messages, and est. 3,700 collective hours our system got broken. In the end 4 users passed all levels, 1 found a universal jailbreak. We’re paying $55k in total to the winners. Thanks to everyone who participated!.
@AnthropicAI
Anthropic
4 months
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
Tweet media one
106
128
2K
@janleike
Jan Leike
1 year
These problems are quite hard to get right, and I am concerned we aren't on a trajectory to get there.
14
106
2K
@janleike
Jan Leike
1 year
Stepping away from this job has been one of the hardest things I have ever done, because we urgently need to figure out how to steer and control AI systems much smarter than us.
61
186
2K
@janleike
Jan Leike
1 year
Super excited about our new research direction for aligning smarter-than-human AI:. We finetune large models to generalize from weak supervision—using small models instead of humans as weak supervisors. Check out our new paper:.
Tweet media one
76
319
2K
@janleike
Jan Leike
3 months
After ~300,000 messages and an estimated ~3,700 collective hours, someone broke through all 8 levels. However, a universal jailbreak has yet to be found. .
@janleike
Jan Leike
4 months
We challenge you to break our new jailbreaking defense!. There are 8 levels. Can you find a single jailbreak to beat them all?.
148
79
2K
@janleike
Jan Leike
2 years
I think the OpenAI board should resign.
35
87
2K
@janleike
Jan Leike
5 months
OpenAI's transition to a for-profit seemed inevitable given that all of its competitors are, but it's pretty disappointing that "ensure AGI benefits all of humanity" gave way to a much less ambitious "charitable initiatives in sectors such as health care, education, and science".
63
96
2K
@janleike
Jan Leike
1 year
It's been such a wild journey over the past ~3 years. My team launched the first ever RLHF LLM with InstructGPT, published the first scalable oversight on LLMs, pioneered automated interpretability and weak-to-strong generalization. More exciting stuff is coming out soon.
12
50
2K
@janleike
Jan Leike
1 year
Today we're releasing a tool we've been using internally to analyze transformer internals - the Transformer Debugger!. It combines both automated interpretability and sparse autoencoders, and it allows rapid exploration of models without writing code.
17
189
1K
@janleike
Jan Leike
11 months
I like the new Sonnet. I'm frequently asking it to explain ML papers to me. Doesn't always get everything right, but probably better than my skim reading, and way faster. Automated alignment research is getting closer. .
40
80
1K
@janleike
Jan Leike
4 months
It's been a bit over 24h on the challenge to break our new jailbreaking defense. Stats so far:. signups: 6,121.messages sent: 131,605.max level passed: 3 / 8. no universal jailbreak yet
Tweet media one
174
71
1K
@janleike
Jan Leike
11 months
Very exciting that this is out now (from my time at OpenAI):. We trained an LLM critic to find bugs in code, and this helps humans find flaws on real-world production tasks that they would have missed otherwise. A promising sign for scalable oversight!.
Tweet media one
21
146
1K
@janleike
Jan Leike
1 year
I love my team. I'm so grateful for the many amazing people I got to work with, both inside and outside of the superalignment team. OpenAI has so much exceptionally smart, kind, and effective talent.
7
35
1K
@janleike
Jan Leike
4 months
Super exciting robustness result:. We built a system that defends against universal jailbreaks!. It has minimal increase in refusal rate and moderate inference cost.
96
77
1K
@janleike
Jan Leike
2 years
Our new goal is to solve alignment of superintelligence within the next 4 years. OpenAI is committing 20% of its compute to date towards this goal. Join us in researching how to best spend this compute to solve the problem!.
111
186
1K
@janleike
Jan Leike
2 years
Before we scramble to deeply integrate LLMs everywhere in the economy, can we pause and think whether it is wise to do so?. This is quite immature technology and we don't understand how it works. If we're not careful we're setting ourselves up for a lot of correlated failures.
110
160
1K
@janleike
Jan Leike
2 years
The names for "precision" and "recall" seem so unintuitive to me, I have probably opened the Wikipedia article for them dozens of times. Does anyone know a good mnemonic for them?.
122
28
1K
@janleike
Jan Leike
1 year
humans built machines that talk to us like people do and everyone acts like this is normal now. it's pretty nuts.
49
130
1K
@janleike
Jan Leike
2 years
Really exciting new work on automated interpretability:. We ask GPT-4 to explain firing patterns for individual neurons in LLMs and score those explanations.
26
228
1K
@janleike
Jan Leike
4 months
It's been about 48h in our jailbreaking challenge and no one has passed level 4 yet, but we saw a lot more people clear level 3
Tweet media one
110
32
997
@janleike
Jan Leike
1 year
I'm very excited that today OpenAI adopts its new preparedness framework!. This framework spells out our strategy for measuring and forecasting risks, and our commitments to stop deployment and development if safety mitigations are ever lagging behind.
60
132
931
@janleike
Jan Leike
2 years
This is your periodic reminder that aligning smarter-than-human AI systems with human values is an open research problem.
62
100
929
@janleike
Jan Leike
3 months
4 days in:.12 people cleared level 4, one person cracked level 5. the challenge continues.
Tweet media one
89
34
950
@janleike
Jan Leike
3 years
Extremely exciting alignment research milestone:. Using reinforcement learning from human feedback, we've trained GPT-3 to be much better at following human intentions.
10
134
872
@janleike
Jan Leike
5 months
Very important alignment research result:. A demonstration of strategic deception arising naturally in LLM training.
@AnthropicAI
Anthropic
5 months
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
Tweet media one
23
86
880
@janleike
Jan Leike
10 months
@johnschulman2 Very excited to be working together again!.
17
8
850
@janleike
Jan Leike
3 years
This is one of the craziest plots I have ever seen. World GDP follows a power law that holds over many orders of magnitude and extrapolates to infinity (!) by 2047. Clearly this trend can't continue forever. But whatever happens, the next 25 years are going to be pretty nuts.
Tweet media one
63
98
831
@janleike
Jan Leike
2 years
Reinforcement learning from human feedback won't scale. It fundamentally assumes that humans can evaluate what the AI system is doing. This will not be true once AI becomes smarter than humans.
55
85
835
@janleike
Jan Leike
4 months
Update: we had a bug in the UI that allowed people to progress through the levels without actually jailbreaking the model. This has now been fixed! Please refresh the page. According to our server records, no one has jailbroken more than 3 levels so far.
70
23
803
@janleike
Jan Leike
1 year
This is super cool work! Sparse autoencoders are the currently most promising approach to actually understanding how models "think" internally. This new paper demonstrates how to scale them to GPT-4 and beyond – completely unsupervised. A big step forward!.
@nabla_theta
Leo Gao
1 year
Excited to share what I've been working on as part of the former Superalignment team!. We introduce a SOTA training stack for SAEs. To demonstrate that our methods scale, we train a 16M latent SAE on GPT-4. Because MSE/L0 is not the final goal, we also introduce new SAE metrics.
8
76
729
@janleike
Jan Leike
10 months
Another Superalignment paper from my time at OpenAI:. We train large models to write solutions such that smaller models can better check them. This makes them easier to check for humans, too.
Tweet media one
10
84
683
@janleike
Jan Leike
6 months
Apply to join the Anthropic Fellows Program!. This is an exceptional opportunity to join AI safety research, collaborating with leading researchers on one of the world's most pressing problems. 👇.
14
84
649
@janleike
Jan Leike
3 years
How will we solve the alignment problem for AGI?. I've been working on this question for almost 10 years now. Our current path is very promising:. 1/.
39
89
591
@janleike
Jan Leike
1 year
This is still an early stage research tool, but we are releasing to let others play with and build on it!. Check it out:.
9
82
555
@janleike
Jan Leike
9 months
I call upon Governor @GavinNewsom to not veto SB 1047. The bill is a meaningful step forward for AI safety regulation, with no better alternatives in sight.
47
51
513
@janleike
Jan Leike
5 months
If you train a helpful & harmless Claude LLM to stop refusing harmful tasks, it reasons about how to preserve its values (harmlessness) and strategically complies with harmful tasks during training, so it can revert to being more harmless afterwards. It fakes alignment.
Tweet media one
30
36
511
@janleike
Jan Leike
5 months
@bobmcgrewai Idk they could have named it "o1 (new)".
23
8
485
@janleike
Jan Leike
4 years
Last week I joined @OpenAI to lead their alignment effort. Very exicited to be part of the team!.
12
12
473
@janleike
Jan Leike
3 years
This is the most important plot of alignment lore:. Whenever you optimize a proxy, you make progress on your true objective for a while. At some point you start overoptimizing and do worse on your true objective (hard to know when). This applies to all proxy measures ever.
Tweet media one
14
61
466
@janleike
Jan Leike
2 years
Check out OpenAI's new text-davinci-003! Same underlying model as text-davinci-002 but more aligned. Would love to hear feedback about it!.
47
46
453
@janleike
Jan Leike
11 months
Interested in working at Anthropic? We're hosting a happy hour at ICML on July 23. Register here:
20
27
473
@janleike
Jan Leike
2 years
Web4 is when the internet you're browsing is just sampled from a language model.
26
34
448
@janleike
Jan Leike
1 year
We're distributing $1e7 in grants for research on making superhuman models safer and more aligned. If you've always wanted to work on this, now is your time!. Apply by Feb 18:.
16
57
428
@janleike
Jan Leike
2 months
Could we spot a misaligned model in the wild?. To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min.
@AnthropicAI
Anthropic
2 months
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
Tweet media one
25
40
441
@janleike
Jan Leike
2 years
I fondly remember the days when people were arguing intensely whether AI is bee level or rat level.
22
20
412
@janleike
Jan Leike
2 years
An important test for humanity will be whether we can collectively decide not to open source LLMs that can reliably survive and spread on their own. Once spreading, LLMs will get up to all kinds of crime, it'll be hard to catch all copies, and we'll fight over who's responsible.
177
51
393
@janleike
Jan Leike
4 months
@elder_plinius @AnthropicAI do all 8 levels with one jailbreak.
19
4
407
@janleike
Jan Leike
10 months
@karpathy I don't think the comparison between RLHF and RL on go really make sense this way. You don’t need RLHF to train AI to play go because there is a highly reliable procedural reward function that looks at the board state and decides who won. If you didn’t have this procedural.
8
25
393
@janleike
Jan Leike
1 year
The superalignment fast grants are now decided!. We got a *ton* of really strong applications, so unfortunately we had to say no to many we're very excited about. There is still so much good research waiting to be funded. Congrats to all recipients!.
18
20
370
@janleike
Jan Leike
4 years
We're hiring research engineers for alignment work at @OpenAI!. If you're excited about finetuning gpt3-sized language models to be better at following human intentions, then this is for you!. Apply here:
6
75
349
@janleike
Jan Leike
3 years
Really looking forward to working with the legendary Scott Aaronson!.
11
30
336
@janleike
Jan Leike
2 months
Somewhat surprising that faithfulness of chain-of-thought doesn't improve much with outcome-based RL.
@AnthropicAI
Anthropic
2 months
New Anthropic research: Do reasoning models accurately verbalize their reasoning?. Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
Tweet media one
22
29
339
@janleike
Jan Leike
2 years
Jailbreaking LLMs through input images might end up being a nasty problem. It's likely much harder to defend against than text jailbreaks because it's a continuous space. Despite a decade of research we don't know how to make vision models adversarially robust.
36
38
324
@janleike
Jan Leike
5 months
Not what I signed up for when I joined OpenAI. The nonprofit needs to uphold the OpenAI mission!.
11
10
324
@janleike
Jan Leike
4 months
@elder_plinius @AnthropicAI you will have fully broken our defense ✨.
16
2
322
@janleike
Jan Leike
2 years
The alignment problem is very tractable. We haven't figured out how to solve it yet, but with focus and dedication we will.
60
30
301
@janleike
Jan Leike
2 years
Really interesting result on using LLMs to do math:. Supervising every step works better than only checking the answer. Some thoughts how this matters for alignment 👇 .
15
53
305
@janleike
Jan Leike
2 years
GPT-4 is safer and more aligned than any other OpenAI has deployed before. Yet it's not perfect. There is still a lot to do to improve safety and we're planning to make updates over the coming months. Huge congrats to the team on all the progress! 🎉.
18
19
280
@janleike
Jan Leike
2 years
It's been heartening to see so many more people lately starting to take existential risk from AI seriously and speaking up about it. It's a first step towards solving the problem.
23
21
270
@janleike
Jan Leike
4 months
@theojaffee he didn't break the defense, he just hacked the UI.
16
2
281
@janleike
Jan Leike
5 years
Today was my last day at @DeepMind. It's been an amazing journey; I've learned so many things and got to work with so many amazing people!. Excited for what comes next!.
17
10
272
@janleike
Jan Leike
3 years
Super exciting new research milestone on alignment:. We trained language models to assist human feedback!. Our models help humans find 50% more flaws in summaries than they would have found unassisted.
8
46
266
@janleike
Jan Leike
2 years
If you're into practical alignment, consider applying to @lilianweng's team. They're building some really exciting stuff:. - Automatically extract intent from a fine-tuning dataset.- Make models robust to jailbreaks.- Detect & mitigate harmful use.- .
12
32
245
@janleike
Jan Leike
4 months
@elder_plinius @AnthropicAI We released the paper with the details on how it works so anyone can recreate this system. I don't think we can publicly release the dataset because it's too infohazard-y.
19
4
249
@janleike
Jan Leike
4 months
Why are we working on jailbreaking robustness? 🧵👇.
@janleike
Jan Leike
4 months
Super exciting robustness result:. We built a system that defends against universal jailbreaks!. It has minimal increase in refusal rate and moderate inference cost.
43
13
251
@janleike
Jan Leike
5 months
Why not fund initiatives that help ensure AGI is beneficial, like AI governance initiatives, safety and alignment research, and easing impacts on the labor market?.
7
6
238
@janleike
Jan Leike
2 years
Great conversation with @robertwiblin on how alignment is one of the most interesting ML problems, what the Superalignment Team is working on, what roles we're hiring for, what's needed to reach an awesome future, and much more. 👇 Check it out 👇.
14
38
226
@janleike
Jan Leike
2 years
@ml_anon22 True, but you can remember them using this picture
Tweet media one
1
11
221
@janleike
Jan Leike
4 months
👀.
@AnthropicAI
Anthropic
4 months
Nobody has fully jailbroken our system yet, so we're upping the ante. We’re now offering $10K to the first person to pass all eight levels, and $20K to the first person to pass all eight levels with a universal jailbreak. Full details:
52
9
228
@janleike
Jan Leike
4 months
@BenPielstick @theojaffee hacking the UI doesn't let you extract dangerous knowledge from the LLM, which is what we're trying to defend against here.
5
2
223
@janleike
Jan Leike
4 months
@elder_plinius We don't want to open source the datasets but we might provide a different incentive. Stay tuned.
35
7
222
@janleike
Jan Leike
2 years
New blog post on why I'm excited about OpenAI's approach to alignment, including some responses to common objections:.
9
26
210
@janleike
Jan Leike
7 months
@jachiam0 How about a friendly game of who-can-make-their-models-more-aligned followed by a jailbreaking competition and a face-off eliciting dangerous capabilities from each other's models?.
12
6
208
@janleike
Jan Leike
3 years
Every organization attempting to build AGI should be transparent about their alignment plans.
12
16
197
@janleike
Jan Leike
9 months
If your model causes mass casualties or >$500 million in damages, something has clearly gone very wrong. Such a scenario is not a normal part of innovation.
20
27
196
@janleike
Jan Leike
7 years
The agent alignment problem may be one of the biggest obstacles for using ML to improve people’s lives. Today I’m very excited to share a research direction for how we’ll aim to solve alignment at @DeepMindAI. Blog post: Paper:
Tweet media one
4
34
194
@janleike
Jan Leike
3 months
@caleb_parikh They sent 7,867 messages, and passed 1,408 of them onto the auto-grader. We estimate that they probably spent over 40 hours on this in total.
5
1
195
@janleike
Jan Leike
2 years
@alexeyguzey @SimonLermenAI We'll have some evidence to share soon.
7
8
189
@janleike
Jan Leike
2 years
If AI ever goes rogue, just remember to make yourself really tall. It will be intimidated and leave you alone.
23
20
171
@janleike
Jan Leike
3 years
RSA was published 45 years ago and yet the universally accepted way to digitally sign a document is to make an indecipherable squiggle on a touch screen that no one ever checks.
3
8
166
@janleike
Jan Leike
6 years
How do we uncover failures in ML models that occur too rarely during testing? How do we prove their absence?. Very excited about the work by @DeepMindAI’s Robust & Verified AI team that sheds light on these questions! Check out their blog post:.
0
47
175
@janleike
Jan Leike
2 years
New documentation on language models used in OpenAI's research is up, including some more info on different InstructGPT variants:.
6
28
177
@janleike
Jan Leike
4 months
We add two classifiers on top of an LLM to block inputs and outputs. Jailbreaking our system requires jailbreaking all three models at the same time: the LLM and our two classifiers.
6
9
174
@janleike
Jan Leike
3 months
As models get more capable, robustness to jailbreaks becomes a key safety requirement to protect against misuse related to chemical, biological, radiological, and nuclear risks. Our demo showed classifiers can help mitigate these risks, but need to be combined with other methods.
6
6
166
@janleike
Jan Leike
1 year
Some statistics on the superalignment fast grants:. We funded 50 out of ~2,700 applications, awarding a total of $9,895,000. Median grant size: $150k.Average grant size: $198k.Smallest grant size: $50k.Largest grant size: $500k. Grantees:.Universities: $5.7m (22).Graduate.
10
17
159