Jan Leike @janleike X Profile

Jan Leike

@janleike

Followers

114K

Following

4K

Media

34

Statuses

721

ML Researcher @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.

San Francisco, USA

Joined March 2016

Don't wanna be here? Send us removal request.

Jan Leike

@janleike

1 year

I'm excited to join @AnthropicAI to continue the superalignment mission!. My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.

391

503

9K

Jan Leike

@janleike

6 hours

The fellows program has been very successful, so we want to scale it up and we're looking for someone to run it. This is a pretty impactful role and you don't need to be very technical. Plus Ethan is amazing to work with!.

Ethan Perez

@EthanJPerez

7 hours

We’re hiring someone to run the Anthropic Fellows Program!. Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵.

3

5

75

Jan Leike

@janleike

22 days

If you want to get into alignment research, imo this is one of the best ways to do it. Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time. Application deadline is this Sunday!.

Anthropic

@AnthropicAI

1 month

We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.

17

18

350

andrew chen

@andrewchen

2 days

A16Z SPEEDRUN 2026 UPDATE:.I think most people secretly know if they’re founders or not. Some of you can never be happy working inside a giant company, writing docs, in endless meetings. Deep down, you know you’re supposed to build. we're opening up a16z speedrun today! We are

114

178

1K

Jan Leike

@janleike

1 month

RT @Jack_W_Lindsey: We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenome….

job-boards.greenhouse.io

San Francisco, CA

0

212

0

Jan Leike

@janleike

1 month

More on the alignment audits:

Jan Leike

@janleike

6 months

Could we spot a misaligned model in the wild?. To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min.

1

0

15

Jan Leike

@janleike

1 month

In March we published a paper on alignment audits: teams of humans were tasked to find the problems in model we trained to be misaligned. Now we have agents that can do it automatically 42% of the time.

Anthropic

@AnthropicAI

1 month

New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

17

26

230

Jan Leike

@janleike

2 months

The authors of this paper say this; I'm just more pessimistic than them about how useful this will be.

4

0

30

Origin Financial

@useorigin

8 hours

New look. New Navigation. New Origin. incoming soon.

1

3

19

Jan Leike

@janleike

2 months

Lots of egregious safety failures probably require reasoning, which is often hard for LLMs to do without showing its hand in the CoT. Probably. Often. A lot of caveats.

3

0

23

Jan Leike

@janleike

2 months

To be clear: CoT monitoring is useful and can let you discover instances of the model hacking rewards, faking alignment, etc. But absence of bad "thoughts" is not evidence that the model is aligned. There are plenty of examples of prod LLMs having misleading CoTs.

1

4

48

Jan Leike

@janleike

2 months

If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do!. But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.

Mikita Balesni 🇺🇦

@balesni

2 months

A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) transparency!. The risk is fragility: RL training, new architectures, etc threaten transparency. Experts from many orgs agree we should try to preserve it:

24

14

313

Jan Leike

@janleike

4 months

RT @sleepinyourhat: 🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment t….

0

164

0

Jan Leike

@janleike

4 months

If you want to read more about Sonnet & Opus 4, including a bunch of alignment and reward hacking findings, check out the model card:.

7

1

98

Jan Leike

@janleike

4 months

It's also (afaik) the first ever frontier model to come with a safety case – a document laying out a detailed argument why we believe the system is safe enough to deploy, despite the increased risks for misuse.

anthropic.com

We have activated the AI Safety Level 3 (ASL-3) Deployment and Security Standards described in Anthropic’s Responsible Scaling Policy (RSP) in conjunction with launching Claude Opus 4. The ASL-3...

5

11

183

Jan Leike

@janleike

4 months

These mitigations address jailbreaking for catastrophic misuse and are based on the constitutional classifiers we demoed earlier this year 👇.

Anthropic

@AnthropicAI

7 months

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

4

3

104

Jan Leike

@janleike

4 months

So many things to love about Claude 4! My favorite is that the model is so strong that we had to turn on additional safety mitigations according to Anthropic's responsible scaling policy.

89

46

1K

Trade Therapy, LLC

@TradeTherapyllc

1 day

Is NVDA distributing (selling off)?. Most traders look at price. We look at volume. Institutions aren't exiting. They're managing inventory through quiet, controlled profit-taking. Big difference. Watch this: “Volume Behavior Reveals the Truth”. A thread below:.

4

0

1

Jan Leike

@janleike

5 months

RT @yanda_chen_: My first paper @AnthropicAI is out!. We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing chal….

0

87

0

Jan Leike

@janleike

5 months

Somewhat surprising that faithfulness of chain-of-thought doesn't improve much with outcome-based RL.

Anthropic

@AnthropicAI

5 months

New Anthropic research: Do reasoning models accurately verbalize their reasoning?. Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

20

29

341

Jan Leike

@janleike

6 months

Check out the blog post for details 👇.

anthropic.com

A collaboration between Anthropic's Alignment Science and Interpretability teams

2

22

Jan Leike

@janleike

6 months

My impression is the future bottleneck will be fixing misalignments, not discovering them:. There are a few recent examples of badly misaligned models (alignment faking, backdoored code -> evil model, reward hacks); better solutions are needed, especially for superhuman models!.

5

4

33

Jan Leike

@janleike

6 months

As models get smarter, they will get better at hiding their misalignments, so this is no guarantee that we'll be able to uncover important misalignments in the future. But our techniques for finding them will improve as well!.

2

0

21

Jan Leike

@janleike

6 months

I was quite surprised how successful the "auditing" teams were at finding the hidden misalignments and how many different strategies worked. A promising application of current interpretability techniques!.

1

0

28

Edward Bukstel

@ebukstel

2 days

Private Equity owns a substantial number of significant lawyer directories. How will AI change the narrative for legal marketing?

2

1

3