hoagy
@HoagyCunningham
Followers
488
Following
2K
Media
11
Statuses
168
Absolutely love the interaction between feature-based and geometric understanding in this paper!
New paper! We reverse engineered the mechanisms underlying Claude Haiku’s ability to perform a simple “perceptual” task. We discover beautiful feature families and manifolds, clean geometric transformations, and distributed attention algorithms!
0
0
29
Work done with @MrinankSharma Vlad Mikulik @AlwinPeng @JerryWeiAI @euan_ong @mishajw126 @FabienDRoger @petrini_linda!
0
0
7
Safeguards (and Control.. and Alignment Science..) are hiring! If you’d like to help ensure that the protections on our models scale with their capabilities, consider applying to the team at
anthropic.com
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
1
0
7
Read the full post for more details at https://t.co/r24Pw1eoJX ! 7/
1
0
6
These results come with some caveats - while we've tested on prompts from past red-teaming these approaches haven’t been red-teamed. Also, improvements in the training of small LLMs would reduce the advantages of this approach. 6/
1
0
6
We can further improve the cost-performance tradeoff by using a two-stage classifier. Using probes or a single retrained layer as the first-stage classifier creates stronger systems for the same level of overall cost than using small models. 5/
1
0
7
Linear probes perform better than a model with 2% of the main model parameters, but worse than a model with 10%. We aggregate activations by using an exponential moving average or probing after a query-related suffix – simply using the mean over tokens is far weaker. 4/
1
0
6
Out of the methods we consider, retraining just the final layer of the policy model is second only to a standalone classifier of the same size as the main model, and outperforms a standalone classifier with 25% of the parameters of the policy model. 3/
1
0
8
Some background: as part of Anthropic’s ASL-3 protections we use classifiers which block bioweapon-related prompts and outputs. Strong dedicated classifiers are expensive, so we’d like to improve cost efficiency by reusing computation from the main model. 2/
2
0
7
New Anthropic blog: We benchmark approaches to making classifiers more cost-effective by reusing activations from the model being queried. We find that using linear probes or retraining just a single layer of the model can push the cost-effectiveness frontier. 🧵1/
9
15
123
Super hyped about this.. circuits are still a WIP but there are probably thousands of novel mechanisms waiting to be discovered in these tools with just the right prompts and careful attention
Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.
0
1
9
But this would still miss the unknown unknowns.. and while we're stil missing lots with SAEs imo way that LLMs compute outputs is a lot less mysterious than it would otherwise be.. and that's a huge, extremely hard to quantify, achievement, and I expect progress to continue!
1
0
11
There probably are many other ways of doing this, like generating 1M datasets for simple probes - pretty doable! Methods limited by the dimension of the activation space can't do this I don't thi k
2
0
4
V common complaint re SAEs, especially at anthropic is a lack of baselines. I sort of agree but I think a real baseline has to also generate a comparable artifact, not just be a better probe
1
0
4
We can understand the forward pass much more with circuits, we can somewhat track the impact of training with model diffing and im excited about understanding the backward pass through this lens too
1
0
5
But, they are a genuine (though imperfect) map of the latent space! And that's huge! Less because immediate utility and more bc of what we can build on top.
1
0
11
Since they're not fundamental, on their own they won't beat supervised baselines and that's also sad
1
0
4
In particular we need this because the space is so high dimensional that we can't just point at random, we need to focus on high density directions (SAEs have other inductive biases, but less clearly useful)
2
0
8
Instead, I think they're best understood like pixels in an image- not really that meaningful on their own but they give a fairly comprehensive overview
3
0
12
First, features are v unlikely to be the 'real unit' of analysis, and that's sad. Finding them all was probably impossible in real models anyway since they're always being created and evolving
1
0
13