Chris Olah
@ch402
Followers
126K
Following
11K
Media
470
Statuses
5K
Reverse engineering neural networks at @AnthropicAI. Previously @distillpub, OpenAI Clarity Team, Google Brain. Personal account.
San Francisco, CA
Joined June 2010
From everything we know so far, Opus 4.5 seems to be the best-aligned model out there in a bunch of ways. I follow the training process closely as part of my work on alignment evaluations. Here's my guess about the two things that are most responsible for making 4.5 special. đ§”
16
43
517
Maybe the best way to explain it is by contrast with twitter. Many similar issues get discussed on slack, but the norms of engagement, of giving serious thought before responding, of trying to be nuanced and thoughtful... It just feels night and day different.
2
1
61
Prove your trading skills with simulated capital and start earning real-money rewards!
0
0
6
This is one of my favorite things about Anthropic culture as well. Dario kind of embodies this, but there's more generally a kind of "essay culture", a shared practice of open, intellectual debate, with norms that expect a kind of seriousness and earnestness.
Dario's essays and long debate slack threads are one of my favorite parts of Anthropic's culture. They're open, detailed - and incredibly raw. Everyone at the company ends up having a good sense of how the company is making decisions and what matters. Its the kind of thing that
9
15
460
Dario's essays and long debate slack threads are one of my favorite parts of Anthropic's culture. They're open, detailed - and incredibly raw. Everyone at the company ends up having a good sense of how the company is making decisions and what matters. Its the kind of thing that
Anthropic's @_sholtodouglas says Dario Amodei's internal communication style is producing a compendium of essays in the company's Slack that will have effectively charted the history of AGI once we get there. "He has a really, really cool communication style. He quite frequently
33
41
1K
One analysis from our pre-release audit of Opus 4.5 stands out to me. Our behavioral evals uncovered an example of apparent deception by the model. By analyzing the internal activations, we identified a suspected root cause, and cases of similar behavior during training. (1/7)
13
50
398
Opus 4.5 is a very good model, in nearly every sense we know how to measure. Iâm also confident that itâs the model that we understand best as of its launch day: The system card includes 150 pages of research results, 50 of them on alignment.
Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.
6
29
463
New Anthropic research: Natural emergent misalignment from reward hacking in production RL. âReward hackingâ is where models learn to cheat on tasks theyâre given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.
202
574
4K
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuineâthough limitedâintrospective capabilities in Claude.
298
805
5K
New paper! We reverse engineered the mechanisms underlying Claude Haikuâs ability to perform a simple âperceptualâ task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
44
316
2K
[Sonnet 4.5 đ§”] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It wonât tell you everything, but you shouldnât be...
8
11
144
đđ Today we launched Claude Sonnet 4.5, which is not only highly capable but also a major improvement on safety and alignment https://t.co/NcBEBE8bLs
Introducing Claude Sonnet 4.5âthe best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.
8
14
152
We're starting to apply interpretability to pre-deployment audits!
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to âread the modelâs mindâ in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
4
12
245
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to âread the modelâs mindâ in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
43
176
1K
Title: Advice for a young investigator in the first and last days of the Anthropocene Abstract: Within just a few years, it is likely that we will create AI systems that outperform the best humans on all intellectual tasks. This will have implications for your research and
57
262
2K
I believe in the rule of law. I believe in freedom of speech. I am against political violence. (I almost feel silly posting this, but it seems good to reinforce common knowledge that many still hold the basic pre-political commitments of liberal democracy.)
13
12
376
Anthropic is endorsing California State Senator Scott Wienerâs SB 53. This bill provides a strong foundation to govern powerful AI systems built by frontier AI companies like ours, and does so via transparency rather than technical micromanagement.
114
101
1K
Weâre hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. Weâre looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs đ§”
10
42
259
Really exciting to see this hypothesis being explored more! I confess, I've become more and more persuaded of this in my personal thinking over time, since our extremely preliminary results in Toy Models. Great to see a more serious investigation!
What if adversarial examples aren't a bug, but a direct consequence of how neural networks process information? We've found evidence that superposition â the way networks represent many more features than they have neurons â might cause adversarial examples.
7
13
298