Jan Leike Profile
Jan Leike

@janleike

Followers
111K
Following
4K
Media
33
Statuses
713

ML Researcher @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.

San Francisco, USA
Joined March 2016
Don't wanna be here? Send us removal request.
@janleike
Jan Leike
1 year
I'm excited to join @AnthropicAI to continue the superalignment mission!. My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
389
503
9K
@janleike
Jan Leike
2 days
The authors of this paper say this; I'm just more pessimistic than them about how useful this will be.
4
0
24
@janleike
Jan Leike
2 days
Lots of egregious safety failures probably require reasoning, which is often hard for LLMs to do without showing its hand in the CoT. Probably. Often. A lot of caveats.
1
0
16
@janleike
Jan Leike
2 days
To be clear: CoT monitoring is useful and can let you discover instances of the model hacking rewards, faking alignment, etc. But absence of bad "thoughts" is not evidence that the model is aligned. There are plenty of examples of prod LLMs having misleading CoTs.
1
4
40
@janleike
Jan Leike
2 days
If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do!. But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.
@balesni
Mikita Balesni 🇺🇦
2 days
A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) transparency!. The risk is fragility: RL training, new architectures, etc threaten transparency. Experts from many orgs agree we should try to preserve it:
Tweet media one
21
13
285
@janleike
Jan Leike
2 months
RT @sleepinyourhat: 🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment t….
0
165
0
@janleike
Jan Leike
2 months
If you want to read more about Sonnet & Opus 4, including a bunch of alignment and reward hacking findings, check out the model card:.
7
1
97
@janleike
Jan Leike
2 months
It's also (afaik) the first ever frontier model to come with a safety case – a document laying out a detailed argument why we believe the system is safe enough to deploy, despite the increased risks for misuse.
5
11
181
@janleike
Jan Leike
2 months
These mitigations address jailbreaking for catastrophic misuse and are based on the constitutional classifiers we demoed earlier this year 👇.
@AnthropicAI
Anthropic
5 months
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
Tweet media one
4
3
103
@janleike
Jan Leike
2 months
So many things to love about Claude 4! My favorite is that the model is so strong that we had to turn on additional safety mitigations according to Anthropic's responsible scaling policy.
90
46
1K
@janleike
Jan Leike
4 months
RT @yanda_chen_: My first paper @AnthropicAI is out!. We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing chal….
0
87
0
@janleike
Jan Leike
4 months
Somewhat surprising that faithfulness of chain-of-thought doesn't improve much with outcome-based RL.
@AnthropicAI
Anthropic
4 months
New Anthropic research: Do reasoning models accurately verbalize their reasoning?. Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
Tweet media one
20
29
343
@janleike
Jan Leike
4 months
Check out the blog post for details 👇.
1
2
21
@janleike
Jan Leike
4 months
My impression is the future bottleneck will be fixing misalignments, not discovering them:. There are a few recent examples of badly misaligned models (alignment faking, backdoored code -> evil model, reward hacks); better solutions are needed, especially for superhuman models!.
5
4
32
@janleike
Jan Leike
4 months
As models get smarter, they will get better at hiding their misalignments, so this is no guarantee that we'll be able to uncover important misalignments in the future. But our techniques for finding them will improve as well!.
2
0
20
@janleike
Jan Leike
4 months
I was quite surprised how successful the "auditing" teams were at finding the hidden misalignments and how many different strategies worked. A promising application of current interpretability techniques!.
1
0
27
@janleike
Jan Leike
4 months
The misaligned model is trained to satisfy a list of reward model biases, but hide that it is trying to do so. Some of the reward model biases were only described to it in text and not reinforced with RL. Interestingly, the model still generalized to perform those behaviors.
1
0
23
@janleike
Jan Leike
4 months
Could we spot a misaligned model in the wild?. To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min.
@AnthropicAI
Anthropic
4 months
New Anthropic research: Auditing Language Models for Hidden Objectives. We deliberately trained a model with a hidden misaligned objective and put researchers to the test: Could they figure out the objective without being told?
Tweet media one
24
41
448
@janleike
Jan Leike
5 months
More details in the updated blog post:.
2
3
111
@janleike
Jan Leike
5 months
We’re very grateful to everyone who helped test our system’s jailbreaking robustness and contributed to our safety research. Y’all are awesome :).
4
1
144
@janleike
Jan Leike
5 months
As models get more capable, robustness to jailbreaks becomes a key safety requirement to protect against misuse related to chemical, biological, radiological, and nuclear risks. Our demo showed classifiers can help mitigate these risks, but need to be combined with other methods.
6
6
165