ch402 Profile Banner
Chris Olah Profile
Chris Olah

@ch402

Followers
126K
Following
11K
Media
470
Statuses
5K

Reverse engineering neural networks at @AnthropicAI. Previously @distillpub, OpenAI Clarity Team, Google Brain. Personal account.

San Francisco, CA
Joined June 2010
Don't wanna be here? Send us removal request.
@sleepinyourhat
Sam Bowman
1 day
From everything we know so far, Opus 4.5 seems to be the best-aligned model out there in a bunch of ways. I follow the training process closely as part of my work on alignment evaluations. Here's my guess about the two things that are most responsible for making 4.5 special. đŸ§”
16
43
517
@ch402
Chris Olah
9 days
Maybe the best way to explain it is by contrast with twitter. Many similar issues get discussed on slack, but the norms of engagement, of giving serious thought before responding, of trying to be nuanced and thoughtful... It just feels night and day different.
2
1
61
@FTMO_com
FTMO.com
4 days
Prove your trading skills with simulated capital and start earning real-money rewards!
0
0
6
@ch402
Chris Olah
9 days
This is one of my favorite things about Anthropic culture as well. Dario kind of embodies this, but there's more generally a kind of "essay culture", a shared practice of open, intellectual debate, with norms that expect a kind of seriousness and earnestness.
@_sholtodouglas
Sholto Douglas
12 days
Dario's essays and long debate slack threads are one of my favorite parts of Anthropic's culture. They're open, detailed - and incredibly raw. Everyone at the company ends up having a good sense of how the company is making decisions and what matters. Its the kind of thing that
9
15
460
@_sholtodouglas
Sholto Douglas
12 days
Dario's essays and long debate slack threads are one of my favorite parts of Anthropic's culture. They're open, detailed - and incredibly raw. Everyone at the company ends up having a good sense of how the company is making decisions and what matters. Its the kind of thing that
@tbpn
TBPN
12 days
Anthropic's @_sholtodouglas says Dario Amodei's internal communication style is producing a compendium of essays in the company's Slack that will have effectively charted the history of AGI once we get there. "He has a really, really cool communication style. He quite frequently
33
41
1K
@Jack_W_Lindsey
Jack Lindsey
11 days
One analysis from our pre-release audit of Opus 4.5 stands out to me. Our behavioral evals uncovered an example of apparent deception by the model. By analyzing the internal activations, we identified a suspected root cause, and cases of similar behavior during training. (1/7)
13
50
398
@sleepinyourhat
Sam Bowman
12 days
Opus 4.5 is a very good model, in nearly every sense we know how to measure. I’m also confident that it’s the model that we understand best as of its launch day: The system card includes 150 pages of research results, 50 of them on alignment.
@claudeai
Claude
12 days
Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.
6
29
463
@janleike
Jan Leike
12 days
More progress on Claude's alignment!
@claudeai
Claude
12 days
Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.
20
27
632
@AnthropicAI
Anthropic
15 days
New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.
202
574
4K
@AnthropicAI
Anthropic
1 month
New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
298
805
5K
@wesg52
Wes Gurnee
2 months
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
44
316
2K
@sleepinyourhat
Sam Bowman
2 months
[Sonnet 4.5 đŸ§”] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...
8
11
144
@sprice354_
Sara Price
2 months
🎉🎉 Today we launched Claude Sonnet 4.5, which is not only highly capable but also a major improvement on safety and alignment https://t.co/NcBEBE8bLs
@claudeai
Claude
2 months
Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.
8
14
152
@ch402
Chris Olah
2 months
We're starting to apply interpretability to pre-deployment audits!
@Jack_W_Lindsey
Jack Lindsey
2 months
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
4
12
245
@Jack_W_Lindsey
Jack Lindsey
2 months
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
43
176
1K
@jaschasd
Jascha Sohl-Dickstein
2 months
Title: Advice for a young investigator in the first and last days of the Anthropocene Abstract: Within just a few years, it is likely that we will create AI systems that outperform the best humans on all intellectual tasks. This will have implications for your research and
57
262
2K
@claudeai
Claude
3 months
Keep thinking.
885
3K
27K
@ch402
Chris Olah
3 months
I believe in the rule of law. I believe in freedom of speech. I am against political violence. (I almost feel silly posting this, but it seems good to reinforce common knowledge that many still hold the basic pre-political commitments of liberal democracy.)
13
12
376
@AnthropicAI
Anthropic
3 months
Anthropic is endorsing California State Senator Scott Wiener’s SB 53. This bill provides a strong foundation to govern powerful AI systems built by frontier AI companies like ours, and does so via transparency rather than technical micromanagement.
114
101
1K
@EthanJPerez
Ethan Perez
3 months
We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs đŸ§”
10
42
259
@ch402
Chris Olah
3 months
Really exciting to see this hypothesis being explored more! I confess, I've become more and more persuaded of this in my personal thinking over time, since our extremely preliminary results in Toy Models. Great to see a more serious investigation!
@livgorton
Liv ✈ NeurIPS
3 months
What if adversarial examples aren't a bug, but a direct consequence of how neural networks process information? We've found evidence that superposition – the way networks represent many more features than they have neurons – might cause adversarial examples.
7
13
298