Fantastic work from
@sen_r
and
@ArthurConmy
- done in an impressive 2 week paper sprint! Gated SAEs are a new sparse autoencoder architecture that seem a major Pareto improvement. This is now my team's preferred way to train SAEs, and I hope it'll accelerate the community's work!
New
@GoogleDeepMind
MechInterp work! We introduce Gated SAEs, a Pareto improvement over existing sparse autoencoders.
They find equally good reconstructions with around half as many firing features, while maintaining interpretability (CI 0-13% improvement). Joint w/
@ArthurConmy
In the process, we also scaled both kinds of SAEs to Gemma 7B, showing SAEs work on non-toy models! Gated SAEs remain an interpretable Pareto improvement.
As a side effect, we showed that SAEs work on Gemma's Gated MLPs. That is: Gated SAEs still beat normal SAEs on Gated MLPs!
Two elegant key ideas:
* SAEs use L1 to incentivise sparsity, but it distorts by penalising magnitude. Why not just apply it to finding *which* features fire, not how much?
* The resulting architecture boils down to a bunch of training tricks + replacing ReLUs w/ Jump ReLUs.
Why are Jump ReLUs better? A toy model:
* The SAE wants to map the blue histogram to zero, and the red to itself
* Geometrically, a ReLU sets a threshold, below which is zero, above which is dist to the threshold
* t=1 distorts red by changing distance, t=0 keeps too much blue.
Jump ReLUs solve this! Geometrically, they set *two* lines, below which x gets mapped to zero, and above which we return distance to the *second* line. So "if above 1, return dist to 0" is a great solution to the problem! This lets SAEs ignore interference without distortion.
But Jump ReLUs & avoiding shrinkage aren't enough to explain why Gated SAEs are so great! We fine-tune a frozen normal SAE to have Jump ReLUs, solving its shrinkage, but underperforming Gated SAEs! We speculate that Jump ReLUs need better enc dirs to exploit the new expressivity
A key concern in dictionary learning + interp work is that your learning method will be *too* powerful, and learn a great sparse reconstruction that's too complex for the actual LLM to decode or use, and so is not faithful to the model's computation, and is likely uninterpretable
I care a lot about avoiding this trap! This is one of the reasons the field has somewhat converged on SAEs, not more powerful methods. And real models don't have discontinuities like a Jump ReLU. So are Gated SAEs BS?
I don't think so! We can't yet prove they (or normal SAEs!) are faithful to the model's computation, but they were between comparable and more interpretable in a double blind human study, which is evidence. I'm excited for future work on if SAEs find causally meaningful features.
An upset: Just before release, we got the final interp results, and *just* dropped below statistical significance for Gated SAEs being more interpretable. In my heart of hearts, I am a Bayesian, and I'd bet that they're better, but make of our results what you will (n=342)