Jan Leike Profile
Jan Leike

@janleike

Followers
114K
Following
4K
Media
34
Statuses
721

ML Researcher @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.

San Francisco, USA
Joined March 2016
Don't wanna be here? Send us removal request.
@janleike
Jan Leike
1 year
I'm excited to join @AnthropicAI to continue the superalignment mission!. My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
391
503
9K
@janleike
Jan Leike
6 hours
The fellows program has been very successful, so we want to scale it up and we're looking for someone to run it. This is a pretty impactful role and you don't need to be very technical. Plus Ethan is amazing to work with!.
@EthanJPerez
Ethan Perez
7 hours
We’re hiring someone to run the Anthropic Fellows Program!. Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵.
3
5
75
@janleike
Jan Leike
22 days
If you want to get into alignment research, imo this is one of the best ways to do it. Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time. Application deadline is this Sunday!.
@AnthropicAI
Anthropic
1 month
We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
Tweet media one
17
18
350
@andrewchen
andrew chen
2 days
A16Z SPEEDRUN 2026 UPDATE:.I think most people secretly know if they’re founders or not. Some of you can never be happy working inside a giant company, writing docs, in endless meetings. Deep down, you know you’re supposed to build. we're opening up a16z speedrun today! We are
114
178
1K
@janleike
Jan Leike
1 month
RT @Jack_W_Lindsey: We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic!  We'll be researching phenome….
Tweet card summary image
job-boards.greenhouse.io
San Francisco, CA
0
212
0
@janleike
Jan Leike
1 month
More on the alignment audits:
@janleike
Jan Leike
6 months
Could we spot a misaligned model in the wild?. To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min.
1
0
15
@janleike
Jan Leike
1 month
In March we published a paper on alignment audits: teams of humans were tasked to find the problems in model we trained to be misaligned. Now we have agents that can do it automatically 42% of the time.
@AnthropicAI
Anthropic
1 month
New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.
Tweet media one
17
26
230
@janleike
Jan Leike
2 months
The authors of this paper say this; I'm just more pessimistic than them about how useful this will be.
4
0
30
@useorigin
Origin Financial
8 hours
New look. New Navigation. New Origin. incoming soon.
1
3
19
@janleike
Jan Leike
2 months
Lots of egregious safety failures probably require reasoning, which is often hard for LLMs to do without showing its hand in the CoT. Probably. Often. A lot of caveats.
3
0
23
@janleike
Jan Leike
2 months
To be clear: CoT monitoring is useful and can let you discover instances of the model hacking rewards, faking alignment, etc. But absence of bad "thoughts" is not evidence that the model is aligned. There are plenty of examples of prod LLMs having misleading CoTs.
1
4
48
@janleike
Jan Leike
2 months
If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do!. But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.
@balesni
Mikita Balesni 🇺🇦
2 months
A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) transparency!. The risk is fragility: RL training, new architectures, etc threaten transparency. Experts from many orgs agree we should try to preserve it:
Tweet media one
24
14
313
@janleike
Jan Leike
4 months
RT @sleepinyourhat: 🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment t….
0
164
0
@janleike
Jan Leike
4 months
If you want to read more about Sonnet & Opus 4, including a bunch of alignment and reward hacking findings, check out the model card:.
7
1
98
@janleike
Jan Leike
4 months
It's also (afaik) the first ever frontier model to come with a safety case – a document laying out a detailed argument why we believe the system is safe enough to deploy, despite the increased risks for misuse.
Tweet card summary image
anthropic.com
We have activated the AI Safety Level 3 (ASL-3) Deployment and Security Standards described in Anthropic’s Responsible Scaling Policy (RSP) in conjunction with launching Claude Opus 4. The ASL-3...
5
11
183
@janleike
Jan Leike
4 months
These mitigations address jailbreaking for catastrophic misuse and are based on the constitutional classifiers we demoed earlier this year 👇.
@AnthropicAI
Anthropic
7 months
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
Tweet media one
4
3
104
@janleike
Jan Leike
4 months
So many things to love about Claude 4! My favorite is that the model is so strong that we had to turn on additional safety mitigations according to Anthropic's responsible scaling policy.
89
46
1K
@TradeTherapyllc
Trade Therapy, LLC
1 day
Is NVDA distributing (selling off)?. Most traders look at price. We look at volume. Institutions aren't exiting. They're managing inventory through quiet, controlled profit-taking. Big difference. Watch this: “Volume Behavior Reveals the Truth”. A thread below:.
4
0
1
@janleike
Jan Leike
5 months
RT @yanda_chen_: My first paper @AnthropicAI is out!. We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing chal….
0
87
0
@janleike
Jan Leike
5 months
Somewhat surprising that faithfulness of chain-of-thought doesn't improve much with outcome-based RL.
@AnthropicAI
Anthropic
5 months
New Anthropic research: Do reasoning models accurately verbalize their reasoning?. Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
Tweet media one
20
29
341
@janleike
Jan Leike
6 months
My impression is the future bottleneck will be fixing misalignments, not discovering them:. There are a few recent examples of badly misaligned models (alignment faking, backdoored code -> evil model, reward hacks); better solutions are needed, especially for superhuman models!.
5
4
33
@janleike
Jan Leike
6 months
As models get smarter, they will get better at hiding their misalignments, so this is no guarantee that we'll be able to uncover important misalignments in the future. But our techniques for finding them will improve as well!.
2
0
21
@janleike
Jan Leike
6 months
I was quite surprised how successful the "auditing" teams were at finding the hidden misalignments and how many different strategies worked. A promising application of current interpretability techniques!.
1
0
28
@ebukstel
Edward Bukstel
2 days
Private Equity owns a substantial number of significant lawyer directories. How will AI change the narrative for legal marketing?
Tweet media one
2
1
3