
Chris Olah
@ch402
Followers
123K
Following
11K
Media
468
Statuses
5K
Reverse engineering neural networks at @AnthropicAI. Previously @distillpub, OpenAI Clarity Team, Google Brain. Personal account.
San Francisco, CA
Joined June 2010
[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...
8
10
128
🎉🎉 Today we launched Claude Sonnet 4.5, which is not only highly capable but also a major improvement on safety and alignment https://t.co/NcBEBE8bLs
Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.
6
13
142
We're starting to apply interpretability to pre-deployment audits!
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
3
12
231
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
44
170
1K
Title: Advice for a young investigator in the first and last days of the Anthropocene Abstract: Within just a few years, it is likely that we will create AI systems that outperform the best humans on all intellectual tasks. This will have implications for your research and
55
255
2K
I believe in the rule of law. I believe in freedom of speech. I am against political violence. (I almost feel silly posting this, but it seems good to reinforce common knowledge that many still hold the basic pre-political commitments of liberal democracy.)
13
13
376
Anthropic is endorsing California State Senator Scott Wiener’s SB 53. This bill provides a strong foundation to govern powerful AI systems built by frontier AI companies like ours, and does so via transparency rather than technical micromanagement.
118
101
1K
We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵
10
41
255
Really exciting to see this hypothesis being explored more! I confess, I've become more and more persuaded of this in my personal thinking over time, since our extremely preliminary results in Toy Models. Great to see a more serious investigation!
What if adversarial examples aren't a bug, but a direct consequence of how neural networks process information? We've found evidence that superposition – the way networks represent many more features than they have neurons – might cause adversarial examples.
7
12
297
Our interpretability team is planning to mentor more fellows this cycle! Applications are due Aug 17.
We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
17
18
331
Our recent work on attribution graphs ( https://t.co/qbIhdV7OKz ) and extending it to attention ( https://t.co/Mf8JLvWH9K ), point towards how much potential they have if we can mitigate the issues.
transformer-circuits.pub
We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
2
3
67
This is probably obvious, but to say it explciitly, I'm still very keen on SAEs and transcoders. :)
5
1
43
What's the point of all of this? For me, this question of mechanistic faithfulness is the most important question in all the SAE debate. I think it's often mixed in with other things and kind of implicit, and I wanted to have a simple example that clearly isolates it.
1
1
36
It turns out there's a fix! If we ask to match the Jacobian of absolute value, we get the correct solution again.
1
0
23
Since absolute value doesn't treat p specially, this solutionn isn't mechanistically faithful. It's using differnet computation to try to mimic abs.
1
0
18
But now let's add a repeated data point to the transcoder training data, p=[1,1,1,0,0,0,0...] The transcoder now learns a special feature to memorize that point!
1
0
21