Chris Olah @ch402 X Profile

Chris Olah

@ch402

Followers

123K

Following

11K

Media

468

Statuses

5K

Reverse engineering neural networks at @AnthropicAI. Previously @distillpub, OpenAI Clarity Team, Google Brain. Personal account.

https://t.co/9YaXXfkNk4

San Francisco, CA

Joined June 2010

Don't wanna be here? Send us removal request.

Sam Bowman

@sleepinyourhat

10 days

[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...

8

10

128

Sara Price

@sprice354_

10 days

🎉🎉 Today we launched Claude Sonnet 4.5, which is not only highly capable but also a major improvement on safety and alignment https://t.co/NcBEBE8bLs

Claude

@claudeai

10 days

Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.

6

13

142

Chris Olah

@ch402

10 days

We're starting to apply interpretability to pre-deployment audits!

Jack Lindsey

@Jack_W_Lindsey

10 days

Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)

3

12

231

Jack Lindsey

@Jack_W_Lindsey

10 days

Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)

44

170

1K

Jascha Sohl-Dickstein

@jaschasd

11 days

Title: Advice for a young investigator in the first and last days of the Anthropocene Abstract: Within just a few years, it is likely that we will create AI systems that outperform the best humans on all intellectual tasks. This will have implications for your research and

55

255

2K

Claude

@claudeai

21 days

Keep thinking.

894

3K

27K

Chris Olah

@ch402

28 days

I believe in the rule of law. I believe in freedom of speech. I am against political violence. (I almost feel silly posting this, but it seems good to reinforce common knowledge that many still hold the basic pre-political commitments of liberal democracy.)

13

376

Anthropic

@AnthropicAI

1 month

Anthropic is endorsing California State Senator Scott Wiener’s SB 53. This bill provides a strong foundation to govern powerful AI systems built by frontier AI companies like ours, and does so via transparency rather than technical micromanagement.

118

101

1K

Ethan Perez

@EthanJPerez

1 month

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

10

41

255

Chris Olah

@ch402

1 month

Really exciting to see this hypothesis being explored more! I confess, I've become more and more persuaded of this in my personal thinking over time, since our extremely preliminary results in Toy Models. Great to see a more serious investigation!

Liv

@livgorton

1 month

What if adversarial examples aren't a bug, but a direct consequence of how neural networks process information? We've found evidence that superposition – the way networks represent many more features than they have neurons – might cause adversarial examples.

7

12

297

Chris Olah

@ch402

2 months

Our interpretability team is planning to mentor more fellows this cycle! Applications are due Aug 17.

Anthropic

@AnthropicAI

2 months

We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.

17

18

331

Chris Olah

@ch402

2 months

Our recent work on attribution graphs ( https://t.co/qbIhdV7OKz ) and extending it to attention ( https://t.co/Mf8JLvWH9K ), point towards how much potential they have if we can mitigate the issues.

transformer-circuits.pub

We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.

2

3

67

Chris Olah

@ch402

2 months

This is probably obvious, but to say it explciitly, I'm still very keen on SAEs and transcoders. :)

5

1

43

Chris Olah

@ch402

2 months

What's the point of all of this? For me, this question of mechanistic faithfulness is the most important question in all the SAE debate. I think it's often mixed in with other things and kind of implicit, and I wanted to have a simple example that clearly isolates it.

1

36

Chris Olah

@ch402

2 months

More details in the note -

1

18

Chris Olah

@ch402

2 months

It turns out there's a fix! If we ask to match the Jacobian of absolute value, we get the correct solution again.

1

0

23

Chris Olah

@ch402

2 months

Since absolute value doesn't treat p specially, this solutionn isn't mechanistically faithful. It's using differnet computation to try to mimic abs.

1

0

18

Chris Olah

@ch402

2 months

But now let's add a repeated data point to the transcoder training data, p=[1,1,1,0,0,0,0...] The transcoder now learns a special feature to memorize that point!

1

0

21

Chris Olah

@ch402

2 months

Transcoders can will learn the perfect solution!

1

20