@NeelNanda5
Neel Nanda
2 months
Fantastic work from @sen_r and @ArthurConmy - done in an impressive 2 week paper sprint! Gated SAEs are a new sparse autoencoder architecture that seem a major Pareto improvement. This is now my team's preferred way to train SAEs, and I hope it'll accelerate the community's work!
@sen_r
Senthooran Rajamanoharan
2 months
New @GoogleDeepMind MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders. They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/ @ArthurConmy
Tweet media one
5
26
164
1
10
77

Replies

@NeelNanda5
Neel Nanda
2 months
In the process, we also scaled both kinds of SAEs to Gemma 7B, showing SAEs work on non-toy models! Gated SAEs remain an interpretable Pareto improvement. As a side effect, we showed that SAEs work on Gemma's Gated MLPs. That is: Gated SAEs still beat normal SAEs on Gated MLPs!
Tweet media one
1
1
6
@NeelNanda5
Neel Nanda
2 months
Two elegant key ideas: * SAEs use L1 to incentivise sparsity, but it distorts by penalising magnitude. Why not just apply it to finding *which* features fire, not how much? * The resulting architecture boils down to a bunch of training tricks + replacing ReLUs w/ Jump ReLUs.
Tweet media one
Tweet media two
1
0
4
@NeelNanda5
Neel Nanda
2 months
Why are Jump ReLUs better? A toy model: * The SAE wants to map the blue histogram to zero, and the red to itself * Geometrically, a ReLU sets a threshold, below which is zero, above which is dist to the threshold * t=1 distorts red by changing distance, t=0 keeps too much blue.
Tweet media one
1
0
2
@NeelNanda5
Neel Nanda
2 months
Jump ReLUs solve this! Geometrically, they set *two* lines, below which x gets mapped to zero, and above which we return distance to the *second* line. So "if above 1, return dist to 0" is a great solution to the problem! This lets SAEs ignore interference without distortion.
Tweet media one
1
0
3
@NeelNanda5
Neel Nanda
2 months
But Jump ReLUs & avoiding shrinkage aren't enough to explain why Gated SAEs are so great! We fine-tune a frozen normal SAE to have Jump ReLUs, solving its shrinkage, but underperforming Gated SAEs! We speculate that Jump ReLUs need better enc dirs to exploit the new expressivity
Tweet media one
1
0
1
@NeelNanda5
Neel Nanda
2 months
A key concern in dictionary learning + interp work is that your learning method will be *too* powerful, and learn a great sparse reconstruction that's too complex for the actual LLM to decode or use, and so is not faithful to the model's computation, and is likely uninterpretable
1
0
2
@NeelNanda5
Neel Nanda
2 months
I care a lot about avoiding this trap! This is one of the reasons the field has somewhat converged on SAEs, not more powerful methods. And real models don't have discontinuities like a Jump ReLU. So are Gated SAEs BS?
1
0
1
@NeelNanda5
Neel Nanda
2 months
I don't think so! We can't yet prove they (or normal SAEs!) are faithful to the model's computation, but they were between comparable and more interpretable in a double blind human study, which is evidence. I'm excited for future work on if SAEs find causally meaningful features.
1
0
2
@NeelNanda5
Neel Nanda
2 months
An upset: Just before release, we got the final interp results, and *just* dropped below statistical significance for Gated SAEs being more interpretable. In my heart of hearts, I am a Bayesian, and I'd bet that they're better, but make of our results what you will (n=342)
Tweet media one
1
0
4
@NeelNanda5
Neel Nanda
2 months
Check out the paper here!
0
0
6