ch402 Profile Banner
Chris Olah Profile
Chris Olah

@ch402

Followers
123K
Following
11K
Media
468
Statuses
5K

Reverse engineering neural networks at @AnthropicAI. Previously @distillpub, OpenAI Clarity Team, Google Brain. Personal account.

San Francisco, CA
Joined June 2010
Don't wanna be here? Send us removal request.
@sleepinyourhat
Sam Bowman
10 days
[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...
8
10
128
@sprice354_
Sara Price
10 days
🎉🎉 Today we launched Claude Sonnet 4.5, which is not only highly capable but also a major improvement on safety and alignment https://t.co/NcBEBE8bLs
@claudeai
Claude
10 days
Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.
6
13
142
@ch402
Chris Olah
10 days
We're starting to apply interpretability to pre-deployment audits!
@Jack_W_Lindsey
Jack Lindsey
10 days
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
3
12
231
@Jack_W_Lindsey
Jack Lindsey
10 days
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
44
170
1K
@jaschasd
Jascha Sohl-Dickstein
11 days
Title: Advice for a young investigator in the first and last days of the Anthropocene Abstract: Within just a few years, it is likely that we will create AI systems that outperform the best humans on all intellectual tasks. This will have implications for your research and
55
255
2K
@claudeai
Claude
21 days
Keep thinking.
894
3K
27K
@ch402
Chris Olah
28 days
I believe in the rule of law. I believe in freedom of speech. I am against political violence. (I almost feel silly posting this, but it seems good to reinforce common knowledge that many still hold the basic pre-political commitments of liberal democracy.)
13
13
376
@AnthropicAI
Anthropic
1 month
Anthropic is endorsing California State Senator Scott Wiener’s SB 53. This bill provides a strong foundation to govern powerful AI systems built by frontier AI companies like ours, and does so via transparency rather than technical micromanagement.
118
101
1K
@EthanJPerez
Ethan Perez
1 month
We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵
10
41
255
@ch402
Chris Olah
1 month
Really exciting to see this hypothesis being explored more! I confess, I've become more and more persuaded of this in my personal thinking over time, since our extremely preliminary results in Toy Models. Great to see a more serious investigation!
@livgorton
Liv
1 month
What if adversarial examples aren't a bug, but a direct consequence of how neural networks process information? We've found evidence that superposition – the way networks represent many more features than they have neurons – might cause adversarial examples.
7
12
297
@ch402
Chris Olah
2 months
Our interpretability team is planning to mentor more fellows this cycle! Applications are due Aug 17.
@AnthropicAI
Anthropic
2 months
We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.
17
18
331
@ch402
Chris Olah
2 months
This is probably obvious, but to say it explciitly, I'm still very keen on SAEs and transcoders. :)
5
1
43
@ch402
Chris Olah
2 months
What's the point of all of this? For me, this question of mechanistic faithfulness is the most important question in all the SAE debate. I think it's often mixed in with other things and kind of implicit, and I wanted to have a simple example that clearly isolates it.
1
1
36
@ch402
Chris Olah
2 months
More details in the note -
1
1
18
@ch402
Chris Olah
2 months
It turns out there's a fix! If we ask to match the Jacobian of absolute value, we get the correct solution again.
1
0
23
@ch402
Chris Olah
2 months
Since absolute value doesn't treat p specially, this solutionn isn't mechanistically faithful. It's using differnet computation to try to mimic abs.
1
0
18
@ch402
Chris Olah
2 months
But now let's add a repeated data point to the transcoder training data, p=[1,1,1,0,0,0,0...] The transcoder now learns a special feature to memorize that point!
1
0
21
@ch402
Chris Olah
2 months
Transcoders can will learn the perfect solution!
1
1
20