akbir.
@akbirkhan
Followers
2K
Following
11K
Media
295
Statuses
6K
researcher at @AnthropicAI
Joined June 2011
here is my thesis “Safe Automated Research” i worked on 3 approaches to make sure we can trust the output of automated researchers as we reach this new era of science it was a very fun PhD
9
14
204
CORE-Bench is solved (using Opus 4.5 with Claude Code) TL;DR: Last week, we released results for Opus 4.5 on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks. Earlier this week, Nicholas Carlini reached out to share that an updated scaffold that uses
27
110
776
I’m so proud to have led this work, and really excited that it’s out now. We decided to study how Anthropic engineers/researchers’ jobs are changing because we thought: ok, AI is being used a lot in people’s jobs, and there’s a lot of speculation of what that might mean, but not
How is AI changing work inside Anthropic? And what might this tell us about the effects on the wider labor force to come? We surveyed 132 of our engineers, conducted 53 in-depth interviews, and analyzed 200K internal Claude Code sessions to find out. https://t.co/YLLjs9W9e5
14
33
631
New on our Frontier Red Team blog: We tested whether AIs can exploit blockchain smart contracts. In simulated testing, AI agents found $4.6M in exploits. The research (with @MATSprogram and the Anthropic Fellows program) also developed a new benchmark:
12
32
2K
Misalignment will be an important driver of risk, so we're developing methods for red-teaming model behaviour. Very excited to publicly release a case study applying our alignment testing methodology to Claude Opus 4.5, Opus 4.1 and Sonnet 4.5 from @AnthropicAI! 🧵
3
9
44
We evaluated Gemini Pro 3 and Claude 4.5 Opus, Sonnet, and Haiku on CORE-Bench. - CORE-Bench consists of scientific reproduction tasks. Agents have to reproduce scientific papers using the code and data for the paper. - Opus 4.1 continues to have the highest accuracy on
8
15
142
New Anthropic research: We build a diverse suite of dishonest models and use it to systematically test methods for improving honesty and detecting lies. Of the 25+ methods we tested, simple ones, like fine-tuning models to be honest despite deceptive instructions, worked best.
23
44
386
Always more to do but I'm proud of how safe Opus 4.5 is! (System Card section 6.2) https://t.co/ncvy5rIblk
5
10
127
Opus 4.5 is released today! In addition to being a big step forward in day-to-day work use-cases, I find the personality hits the right tone (very subjective of course, but suits my tastes). And if pre-training is your thing - the Zurich Team still has a spot or two left (see
Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.
3
2
89
There's never been a more exciting time to work on Code RL at A\—we're a small, close-knit team that works across pretraining, posttraining, safety, product, and more. I promise close to zero politics and a "you can just do things" attitude. And we're hiring! DMs are open :)
Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.
8
13
300
Opus 4.5 is a very good model, in nearly every sense we know how to measure. I’m also confident that it’s the model that we understand best as of its launch day: The system card includes 150 pages of research results, 50 of them on alignment.
Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.
6
29
464
exciting results - also appreciate this makes most sense when you anthropomorphise models!
New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.
0
0
3
Yep! Things seem to be going somewhat slower than the AI 2027 scenario. Our timelines were longer than 2027 when we published and now they are a bit longer still; "around 2030, lots of uncertainty though" is what I say these days.
122
85
1K
In an extreme stress test, Antropic’s AI models resorted to blackmail to avoid being shut down. Research scientist Joshua Batson shows @andersoncooper how it happened and what they learned from it. https://t.co/oDjW5iHujd
60
326
1K
I disagree with the claim that the Anthropic report or cyber capabilities are “not a huge deal.” Models are getting stronger all the time and they will significantly reduce the skill level needed to carry out attacks which could have large impact. It is true that in the long
It's interesting that the idea of dangerous capabilities evaluations first originated in a context where much public commentary was anchored on stochastic parrots and "AI can't generate fingers, how could it ever be a threat beyond bias?" So it made a lot of sense to build toy
9
6
92
I noticed a couple ants in my bathroom yesterday, so I bought some Terro, put it out this morning, and now I have ~500 ants in my bathroom :/
3
1
10