
Jan Leike
@janleike
Followers
114K
Following
4K
Media
34
Statuses
721
ML Researcher @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.
San Francisco, USA
Joined March 2016
I'm excited to join @AnthropicAI to continue the superalignment mission!. My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
391
503
9K
The fellows program has been very successful, so we want to scale it up and we're looking for someone to run it. This is a pretty impactful role and you don't need to be very technical. Plus Ethan is amazing to work with!.
We’re hiring someone to run the Anthropic Fellows Program!. Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵.
3
5
75
If you want to get into alignment research, imo this is one of the best ways to do it. Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time. Application deadline is this Sunday!.
We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
17
18
350
RT @Jack_W_Lindsey: We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenome….
job-boards.greenhouse.io
San Francisco, CA
0
212
0
In March we published a paper on alignment audits: teams of humans were tasked to find the problems in model we trained to be misaligned. Now we have agents that can do it automatically 42% of the time.
New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.
17
26
230
If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do!. But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.
A simple AGI safety technique: AI’s thoughts are in plain English, just read them. We know it works, with OK (not perfect) transparency!. The risk is fragility: RL training, new architectures, etc threaten transparency. Experts from many orgs agree we should try to preserve it:
24
14
313
RT @sleepinyourhat: 🧵✨🙏 With the new Claude Opus 4, we conducted what I think is by far the most thorough pre-launch alignment assessment t….
0
164
0
It's also (afaik) the first ever frontier model to come with a safety case – a document laying out a detailed argument why we believe the system is safe enough to deploy, despite the increased risks for misuse.
anthropic.com
We have activated the AI Safety Level 3 (ASL-3) Deployment and Security Standards described in Anthropic’s Responsible Scaling Policy (RSP) in conjunction with launching Claude Opus 4. The ASL-3...
5
11
183
These mitigations address jailbreaking for catastrophic misuse and are based on the constitutional classifiers we demoed earlier this year 👇.
New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.
4
3
104
RT @yanda_chen_: My first paper @AnthropicAI is out!. We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing chal….
0
87
0
Somewhat surprising that faithfulness of chain-of-thought doesn't improve much with outcome-based RL.
New Anthropic research: Do reasoning models accurately verbalize their reasoning?. Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
20
29
341
Check out the blog post for details 👇.
anthropic.com
A collaboration between Anthropic's Alignment Science and Interpretability teams
2
2
22